Unsupervised Learning: Trade & Ahead¶

Problem Statement¶

Context¶

The stock market has consistently proven to be a good place to invest in and save for the future. There are a lot of compelling reasons to invest in stocks. It can help in fighting inflation, create wealth, and also provides some tax benefits. Good steady returns on investments over a long period of time can also grow a lot more than seems possible. Also, thanks to the power of compound interest, the earlier one starts investing, the larger the corpus one can have for retirement. Overall, investing in stocks can help meet life's financial aspirations.

It is important to maintain a diversified portfolio when investing in stocks in order to maximise earnings under any market condition. Having a diversified portfolio tends to yield higher returns and face lower risk by tempering potential losses when the market is down. It is often easy to get lost in a sea of financial metrics to analyze while determining the worth of a stock, and doing the same for a multitude of stocks to identify the right picks for an individual can be a tedious task. By doing a cluster analysis, one can identify stocks that exhibit similar characteristics and ones which exhibit minimum correlation. This will help investors better analyze stocks across different market segments and help protect against risks that could make the portfolio vulnerable to losses.

Objective¶

Trade&Ahead is a financial consultancy firm who provide their customers with personalized investment strategies. They have hired you as a Data Scientist and provided you with data comprising stock price and some financial indicators for a few companies listed under the New York Stock Exchange. They have assigned you the tasks of analyzing the data, grouping the stocks based on the attributes provided, and sharing insights about the characteristics of each group.

Data Dictionary¶

  • Ticker Symbol: An abbreviation used to uniquely identify publicly traded shares of a particular stock on a particular stock market
  • Company: Name of the company
  • GICS Sector: The specific economic sector assigned to a company by the Global Industry Classification Standard (GICS) that best defines its business operations
  • GICS Sub Industry: The specific sub-industry group assigned to a company by the Global Industry Classification Standard (GICS) that best defines its business operations
  • Current Price: Current stock price in dollars
  • Price Change: Percentage change in the stock price in 13 weeks
  • Volatility: Standard deviation of the stock price over the past 13 weeks
  • ROE: A measure of financial performance calculated by dividing net income by shareholders' equity (shareholders' equity is equal to a company's assets minus its debt)
  • Cash Ratio: The ratio of a company's total reserves of cash and cash equivalents to its total current liabilities
  • Net Cash Flow: The difference between a company's cash inflows and outflows (in dollars)
  • Net Income: Revenues minus expenses, interest, and taxes (in dollars)
  • Earnings Per Share: Company's net profit divided by the number of common shares it has outstanding (in dollars)
  • Estimated Shares Outstanding: Company's stock currently held by all its shareholders
  • P/E Ratio: Ratio of the company's current stock price to the earnings per share
  • P/B Ratio: Ratio of the company's stock price per share by its book value per share (book value of a company is the net difference between that company's total assets and total liabilities)

Importing necessary libraries and data¶

In [1]:
# Installing the libraries with the specified version.
# uncomment and run the following line if Google Colab is being used
# !pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.3 yellowbrick==1.5 -q --user

# I will use the latest version of the library as that will best prepare me for a job in the data science field.
In [2]:
# Installing the libraries with the specified version.
# uncomment and run the following lines if Jupyter Notebook is being used
# !pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.2 yellowbrick==1.5 -q --user
# !pip install --upgrade -q jinja2
In [3]:
# Reading, manipulating, and visualizing data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Scaling Data
from sklearn.preprocessing import StandardScaler

# Distances between clusters, k-means, and silhouette
from scipy.spatial.distance import cdist, pdist
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

# Hierarchical clustering, cophenetic correlation, and dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet

# to suppress warnings
import warnings
warnings.filterwarnings("ignore")
In [4]:
# mount my google drive
from google.colab import drive
drive.mount('/content/drive')

# read the csv file to a data frame
df=pd.read_csv('/content/drive/MyDrive/Python Foundations - GreatLearning/stock_data.csv')
data=df.copy()
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

Data Overview¶

  • Observations
  • Sanity checks
In [5]:
# Get an idea what the data looks like
data.head()
Out[5]:
Ticker Symbol Security GICS Sector GICS Sub Industry Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio
0 AAL American Airlines Group Industrials Airlines 42.349998 9.999995 1.687151 135 51 -604000000 7610000000 11.39 6.681299e+08 3.718174 -8.784219
1 ABBV AbbVie Health Care Pharmaceuticals 59.240002 8.339433 2.197887 130 77 51000000 5144000000 3.15 1.633016e+09 18.806350 -8.750068
2 ABT Abbott Laboratories Health Care Health Care Equipment 44.910000 11.301121 1.273646 21 67 938000000 4423000000 2.94 1.504422e+09 15.275510 -0.394171
3 ADBE Adobe Systems Inc Information Technology Application Software 93.940002 13.977195 1.357679 9 180 -240840000 629551000 1.26 4.996437e+08 74.555557 4.199651
4 ADI Analog Devices, Inc. Information Technology Semiconductors 55.320000 -1.827858 1.701169 14 272 315120000 696878000 0.31 2.247994e+09 178.451613 1.059810
In [6]:
data.tail()
Out[6]:
Ticker Symbol Security GICS Sector GICS Sub Industry Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio
335 YHOO Yahoo Inc. Information Technology Internet Software & Services 33.259998 14.887727 1.845149 15 459 -1032187000 -4359082000 -4.64 939457327.6 28.976191 6.261775
336 YUM Yum! Brands Inc Consumer Discretionary Restaurants 52.516175 -8.698917 1.478877 142 27 159000000 1293000000 2.97 435353535.4 17.682214 -3.838260
337 ZBH Zimmer Biomet Holdings Health Care Health Care Equipment 102.589996 9.347683 1.404206 1 100 376000000 147000000 0.78 188461538.5 131.525636 -23.884449
338 ZION Zions Bancorp Financials Regional Banks 27.299999 -1.158588 1.468176 4 99 -43623000 309471000 1.20 257892500.0 22.749999 -0.063096
339 ZTS Zoetis Health Care Pharmaceuticals 47.919998 16.678836 1.610285 32 65 272000000 339000000 0.68 498529411.8 70.470585 1.723068
In [7]:
# see how many rows and columns there are
data.shape
Out[7]:
(340, 15)
In [8]:
# figure out the data types of each column
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 340 entries, 0 to 339
Data columns (total 15 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Ticker Symbol                 340 non-null    object 
 1   Security                      340 non-null    object 
 2   GICS Sector                   340 non-null    object 
 3   GICS Sub Industry             340 non-null    object 
 4   Current Price                 340 non-null    float64
 5   Price Change                  340 non-null    float64
 6   Volatility                    340 non-null    float64
 7   ROE                           340 non-null    int64  
 8   Cash Ratio                    340 non-null    int64  
 9   Net Cash Flow                 340 non-null    int64  
 10  Net Income                    340 non-null    int64  
 11  Earnings Per Share            340 non-null    float64
 12  Estimated Shares Outstanding  340 non-null    float64
 13  P/E Ratio                     340 non-null    float64
 14  P/B Ratio                     340 non-null    float64
dtypes: float64(7), int64(4), object(4)
memory usage: 40.0+ KB
In [9]:
# check if any rows are repeated
data.duplicated().sum()
Out[9]:
0
In [10]:
# check what the percentage breakdown of industries
data['GICS Sector'].value_counts(normalize=True)
Out[10]:
proportion
GICS Sector
Industrials 0.155882
Financials 0.144118
Health Care 0.117647
Consumer Discretionary 0.117647
Information Technology 0.097059
Energy 0.088235
Real Estate 0.079412
Utilities 0.070588
Materials 0.058824
Consumer Staples 0.055882
Telecommunications Services 0.014706

We can see that some of the industry type are more represented than others. This may or may not be reflected in the clusters later.

Exploratory Data Analysis (EDA)¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.
In [11]:
# function to plot a boxplot and a histogram along the same scale.

def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    ) # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram
In [12]:
# create a subset of the data frame that only include columns with numeric values
data_numeric = data.select_dtypes(include=[np.number])

# create a loop that creates a hisogram and boxplot for each numeric column
for feature in data_numeric.columns:
    histogram_boxplot(data_numeric, feature, figsize=(12, 7), kde=False, bins=None)

It seems that nearly every column in this data set has outliers, many of which are extreme outliers. Because we are only trying to group the data into similar clusters and not predict a target variable, this should be fine. If data imputation was needed, median would be used as the outliers could skew the data. That being said, this data set has no missing values.

Questions:

  1. What does the distribution of stock prices look like?
  2. The stocks of which economic sector have seen the maximum price increase on average?
  3. How are the different variables correlated with each other?
  4. Cash ratio provides a measure of a company's ability to cover its short-term obligations using only cash and cash equivalents. How does the average cash ratio vary across economic sectors?
  5. P/E ratios can help determine the relative value of a company's shares as they signify the amount of money an investor is willing to invest in a single share of a company per dollar of its earnings. How does the P/E ratio vary, on average, across economic sectors?

EDA Question 1¶

What does the distribution of stock prices look like?

In [13]:
# use the histogram_boxplot function to show current price distribution
histogram_boxplot(data, 'Current Price', figsize=(12, 7), kde=True, bins=None)

We can see the data is extremely right skewed indicating some stock prices with much higher values. This does not mean those companies are more valuable as they may simply have fewer stocks at a higher price.

EDA Question 2¶

The stocks of which economic sector have seen the maximum price increase on average?

In [14]:
# calculate the mean price increase by gcis sector
data.groupby('GICS Sector')['Price Change'].mean().sort_values(ascending=False)
Out[14]:
Price Change
GICS Sector
Health Care 9.585652
Consumer Staples 8.684750
Information Technology 7.217476
Telecommunications Services 6.956980
Real Estate 6.205548
Consumer Discretionary 5.846093
Materials 5.589738
Financials 3.865406
Industrials 2.833127
Utilities 0.803657
Energy -10.228289

We can see the Health Care sector has done particularly well and the Energy sector has done particularly poorly.

EDA Question 3¶

How are the different variables correlated with each other?

In [15]:
# make a heatmap that shows how each column in the dataframe correlates with each other column in the dataframe
sns.heatmap(data_numeric.corr(), annot=True, cmap="coolwarm")
fig = plt.gcf()
fig.set_size_inches(10, 8)
plt.show()

We can see that the correlations are relatively low overall. All correlations have a magnitude less than 60% correlated. The two most correlated variables are Net Income and Estimated Shares Outstanding.

EDA Question 4¶

Cash ratio provides a measure of a company's ability to cover its short-term obligations using only cash and cash equivalents. How does the average cash ratio vary across economic sectors?

In [16]:
# show the mean cash ratio grouped by gcis sector
data.groupby('GICS Sector')['Cash Ratio'].mean().sort_values(ascending=False)
Out[16]:
Cash Ratio
GICS Sector
Information Technology 149.818182
Telecommunications Services 117.000000
Health Care 103.775000
Financials 98.591837
Consumer Staples 70.947368
Energy 51.133333
Real Estate 50.111111
Consumer Discretionary 49.575000
Materials 41.700000
Industrials 36.188679
Utilities 13.625000

We can see that IT has more cash on hand to cover its short-term obligations while Utilities has much less cash on hand than the other industry types.

EDA Question 5¶

P/E ratios can help determine the relative value of a company's shares as they signify the amount of money an investor is willing to invest in a single share of a company per dollar of its earnings. How does the P/E ratio vary, on average, across economic sectors?

In [17]:
# show mean P/E ratios grouped by gcis sector
data.groupby('GICS Sector')['P/E Ratio'].mean().sort_values(ascending=False)
Out[17]:
P/E Ratio
GICS Sector
Energy 72.897709
Information Technology 43.782546
Real Estate 43.065585
Health Care 41.135272
Consumer Discretionary 35.211613
Consumer Staples 25.521195
Materials 24.585352
Utilities 18.719412
Industrials 18.259380
Financials 16.023151
Telecommunications Services 12.222578

This is interesting because we can see that despite having the largest average negative price change (see EDA Question 1), Energy has the highest P/E ratio meaning people are willing to purchase the shares at a higher average price relative to dollars earned than any other sector. This speaks highly to stockholder's confidence in the energy sector.

Data Preprocessing¶

  • Duplicate value check
  • Missing value treatment
  • Outlier check
  • Feature engineering (if needed)
  • Any other preprocessing steps (if needed)

Duplicate Value Check¶

In [18]:
# duplicate value check
data.duplicated().sum()
Out[18]:
0

Missing Value Treatment¶

In [19]:
# missing value treatment
# check for missing values
data.isna().sum()
Out[19]:
0
Ticker Symbol 0
Security 0
GICS Sector 0
GICS Sub Industry 0
Current Price 0
Price Change 0
Volatility 0
ROE 0
Cash Ratio 0
Net Cash Flow 0
Net Income 0
Earnings Per Share 0
Estimated Shares Outstanding 0
P/E Ratio 0
P/B Ratio 0

Outlier Treatment¶

We observed during the EDA that there were outliers for almost every numeric column, some of which being quite extreme. There aren't many rows in our dataframe, only 340. While these outliers could affect the shape of our clusters, removing them could have a significant impact as well and we have no reason to believe the outlier data is false in any way. We will leave the outliers in.

Scaling¶

In [20]:
# scaling should be done before encoding dummy variables. We don't really care about the variance or standard deviation of binary variables meant for tracking categories
# Scale the numeric columns in the original data frame
scale_data = data.copy()
scaler = StandardScaler()
scale_data[data_numeric.columns] = scaler.fit_transform(scale_data[data_numeric.columns])
In [21]:
scale_data.head()
Out[21]:
Ticker Symbol Security GICS Sector GICS Sub Industry Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio
0 AAL American Airlines Group Industrials Airlines -0.393341 0.493950 0.272749 0.989601 -0.210698 -0.339355 1.554415 1.309399 0.107863 -0.652487 -0.506653
1 ABBV AbbVie Health Care Pharmaceuticals -0.220837 0.355439 1.137045 0.937737 0.077269 -0.002335 0.927628 0.056755 1.250274 -0.311769 -0.504205
2 ABT Abbott Laboratories Health Care Health Care Equipment -0.367195 0.602479 -0.427007 -0.192905 -0.033488 0.454058 0.744371 0.024831 1.098021 -0.391502 0.094941
3 ADBE Adobe Systems Inc Information Technology Application Software 0.133567 0.825696 -0.284802 -0.317379 1.218059 -0.152497 -0.219816 -0.230563 -0.091622 0.947148 0.424333
4 ADI Analog Devices, Inc. Information Technology Semiconductors -0.260874 -0.492636 0.296470 -0.265515 2.237018 0.133564 -0.202703 -0.374982 1.978399 3.293307 0.199196

It appears that our data has been correctly scaled. Next, we need to remove the Ticker Symbol column and the Security column. These columns essentially work as names and do not add any information to our clusters.

Remove Unneeded Columns¶

In [22]:
# remove Ticker Symbol and Security columns from the data frame
scale_data.drop(['Ticker Symbol', 'Security'], axis=1, inplace=True)
In [23]:
scale_data.head()
Out[23]:
GICS Sector GICS Sub Industry Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio
0 Industrials Airlines -0.393341 0.493950 0.272749 0.989601 -0.210698 -0.339355 1.554415 1.309399 0.107863 -0.652487 -0.506653
1 Health Care Pharmaceuticals -0.220837 0.355439 1.137045 0.937737 0.077269 -0.002335 0.927628 0.056755 1.250274 -0.311769 -0.504205
2 Health Care Health Care Equipment -0.367195 0.602479 -0.427007 -0.192905 -0.033488 0.454058 0.744371 0.024831 1.098021 -0.391502 0.094941
3 Information Technology Application Software 0.133567 0.825696 -0.284802 -0.317379 1.218059 -0.152497 -0.219816 -0.230563 -0.091622 0.947148 0.424333
4 Information Technology Semiconductors -0.260874 -0.492636 0.296470 -0.265515 2.237018 0.133564 -0.202703 -0.374982 1.978399 3.293307 0.199196

Dummy Variables and Difference between data1 and data2¶

Because there are SO MANY unique values in GICS Sector and GICS Sub Industry, I am going to move forward with three new data sets as described below:

  • data1 is a copy of the data frame with both GICS Sector and GICS Sub Industry dummy encoded. The concern here will be curse of dimensionality as we are increasing the number of columns from 13 to 124 even with drop_first=True

  • data2 is a copy of the data frame that drops both GICS Sector and GICS Sub Industry. This data frame may lose too much information to be useful but will not suffer from the curse of dimensionality at all

In [24]:
data1=scale_data.copy()
data2=scale_data.copy()
In [25]:
# for data1, encode dummy variables for GICS Sector and GICS Sub Industry
data1 = pd.get_dummies(data1, columns=['GICS Sector', 'GICS Sub Industry'], drop_first=True)

# for each column with data type bool in data1, map "True" to "1" and map "False" to "0" and change type to int
for col in data1.columns:
    if data1[col].dtype == 'bool':
        data1[col] = data1[col].map({True: 1, False: 0})
        data1[col] = data1[col].astype(int)

# for data2, drop GICS Sector and GICS Sub Industry
data2.drop(['GICS Sector', 'GICS Sub Industry'], axis=1, inplace=True)
In [26]:
data1.head()
Out[26]:
Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio GICS Sector_Consumer Staples GICS Sector_Energy GICS Sector_Financials GICS Sector_Health Care GICS Sector_Industrials GICS Sector_Information Technology GICS Sector_Materials GICS Sector_Real Estate GICS Sector_Telecommunications Services GICS Sector_Utilities GICS Sub Industry_Aerospace & Defense GICS Sub Industry_Agricultural Products GICS Sub Industry_Air Freight & Logistics GICS Sub Industry_Airlines GICS Sub Industry_Alternative Carriers GICS Sub Industry_Apparel, Accessories & Luxury Goods GICS Sub Industry_Application Software GICS Sub Industry_Asset Management & Custody Banks GICS Sub Industry_Auto Parts & Equipment GICS Sub Industry_Automobile Manufacturers GICS Sub Industry_Banks GICS Sub Industry_Biotechnology GICS Sub Industry_Brewers GICS Sub Industry_Broadcasting & Cable TV GICS Sub Industry_Building Products GICS Sub Industry_Cable & Satellite GICS Sub Industry_Casinos & Gaming GICS Sub Industry_Computer Hardware GICS Sub Industry_Construction & Farm Machinery & Heavy Trucks GICS Sub Industry_Construction Materials GICS Sub Industry_Consumer Electronics GICS Sub Industry_Consumer Finance GICS Sub Industry_Copper GICS Sub Industry_Data Processing & Outsourced Services GICS Sub Industry_Distributors GICS Sub Industry_Diversified Chemicals GICS Sub Industry_Diversified Commercial Services GICS Sub Industry_Diversified Financial Services GICS Sub Industry_Drug Retail GICS Sub Industry_Electric Utilities GICS Sub Industry_Electrical Components & Equipment GICS Sub Industry_Electronic Components GICS Sub Industry_Electronic Equipment & Instruments GICS Sub Industry_Environmental Services GICS Sub Industry_Fertilizers & Agricultural Chemicals GICS Sub Industry_Financial Exchanges & Data GICS Sub Industry_Gold GICS Sub Industry_Health Care Distributors GICS Sub Industry_Health Care Equipment GICS Sub Industry_Health Care Facilities GICS Sub Industry_Health Care Supplies GICS Sub Industry_Home Entertainment Software GICS Sub Industry_Home Furnishings GICS Sub Industry_Homebuilding GICS Sub Industry_Hotels, Resorts & Cruise Lines GICS Sub Industry_Household Appliances GICS Sub Industry_Household Products GICS Sub Industry_Housewares & Specialties GICS Sub Industry_Human Resource & Employment Services GICS Sub Industry_IT Consulting & Other Services GICS Sub Industry_Industrial Conglomerates GICS Sub Industry_Industrial Gases GICS Sub Industry_Industrial Machinery GICS Sub Industry_Industrial Materials GICS Sub Industry_Insurance Brokers GICS Sub Industry_Integrated Oil & Gas GICS Sub Industry_Integrated Telecommunications Services GICS Sub Industry_Internet & Direct Marketing Retail GICS Sub Industry_Internet Software & Services GICS Sub Industry_Investment Banking & Brokerage GICS Sub Industry_Leisure Products GICS Sub Industry_Life & Health Insurance GICS Sub Industry_Life Sciences Tools & Services GICS Sub Industry_Managed Health Care GICS Sub Industry_Metal & Glass Containers GICS Sub Industry_Motorcycle Manufacturers GICS Sub Industry_Multi-Sector Holdings GICS Sub Industry_Multi-line Insurance GICS Sub Industry_MultiUtilities GICS Sub Industry_Networking Equipment GICS Sub Industry_Office REITs GICS Sub Industry_Oil & Gas Equipment & Services GICS Sub Industry_Oil & Gas Exploration & Production GICS Sub Industry_Oil & Gas Refining & Marketing & Transportation GICS Sub Industry_Packaged Foods & Meats GICS Sub Industry_Paper Packaging GICS Sub Industry_Personal Products GICS Sub Industry_Pharmaceuticals GICS Sub Industry_Property & Casualty Insurance GICS Sub Industry_Publishing GICS Sub Industry_REITs GICS Sub Industry_Railroads GICS Sub Industry_Real Estate Services GICS Sub Industry_Regional Banks GICS Sub Industry_Research & Consulting Services GICS Sub Industry_Residential REITs GICS Sub Industry_Restaurants GICS Sub Industry_Retail REITs GICS Sub Industry_Semiconductor Equipment GICS Sub Industry_Semiconductors GICS Sub Industry_Soft Drinks GICS Sub Industry_Specialized REITs GICS Sub Industry_Specialty Chemicals GICS Sub Industry_Specialty Retail GICS Sub Industry_Specialty Stores GICS Sub Industry_Steel GICS Sub Industry_Technology Hardware, Storage & Peripherals GICS Sub Industry_Technology, Hardware, Software and Supplies GICS Sub Industry_Thrifts & Mortgage Finance GICS Sub Industry_Tires & Rubber GICS Sub Industry_Tobacco GICS Sub Industry_Trucking GICS Sub Industry_Water Utilities
0 -0.393341 0.493950 0.272749 0.989601 -0.210698 -0.339355 1.554415 1.309399 0.107863 -0.652487 -0.506653 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 -0.220837 0.355439 1.137045 0.937737 0.077269 -0.002335 0.927628 0.056755 1.250274 -0.311769 -0.504205 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 -0.367195 0.602479 -0.427007 -0.192905 -0.033488 0.454058 0.744371 0.024831 1.098021 -0.391502 0.094941 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0.133567 0.825696 -0.284802 -0.317379 1.218059 -0.152497 -0.219816 -0.230563 -0.091622 0.947148 0.424333 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 -0.260874 -0.492636 0.296470 -0.265515 2.237018 0.133564 -0.202703 -0.374982 1.978399 3.293307 0.199196 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

Secondary EDA Note¶

Because the variance and standard deviation of columns were maintained after scaling and no outliers were removed, all the graphs would look identical except that the scale of the units would be less readable. Secondary EDA is not necessary in this case.

K-Means Clustering on data1¶

As a reminder: data1 is a copy of the data frame with both GICS Sector and GICS Sub Industry dummy encoded. The concern here will be curse of dimensionality as we are increasing the number of columns from 13 to 124 even with drop_first=True

In [27]:
# make a copy of data1 so we don't change it so we can use data1 for hierarchical clustering later
k_means_df1 = data1.copy()

Elbow Score for data1¶

In [28]:
clusters = range(1, 15)
meanDistortions = []

for k in clusters:
    model = KMeans(n_clusters=k, random_state=1)
    model.fit(data1)
    prediction = model.predict(k_means_df1)
    distortion = (
        sum(np.min(cdist(k_means_df1, model.cluster_centers_, "euclidean"), axis=1))
        / k_means_df1.shape[0]
    )

    meanDistortions.append(distortion)

    print("Number of Clusters:", k, "\tAverage Distortion:", distortion)

plt.plot(clusters, meanDistortions, "bx-")
plt.xlabel("k")
plt.ylabel("Average Distortion")
plt.title("Selecting k with the Elbow Method", fontsize=20)
plt.show()
Number of Clusters: 1 	Average Distortion: 2.964689739825615
Number of Clusters: 2 	Average Distortion: 2.8683405676145317
Number of Clusters: 3 	Average Distortion: 2.803663383067996
Number of Clusters: 4 	Average Distortion: 2.6549208303886864
Number of Clusters: 5 	Average Distortion: 2.5709688021715866
Number of Clusters: 6 	Average Distortion: 2.5215253702526255
Number of Clusters: 7 	Average Distortion: 2.4627896099777917
Number of Clusters: 8 	Average Distortion: 2.43557403792719
Number of Clusters: 9 	Average Distortion: 2.3905822333386255
Number of Clusters: 10 	Average Distortion: 2.3356012840014118
Number of Clusters: 11 	Average Distortion: 2.2955311559289453
Number of Clusters: 12 	Average Distortion: 2.253550018031723
Number of Clusters: 13 	Average Distortion: 2.2073357384639367
Number of Clusters: 14 	Average Distortion: 2.1815365543372605
In [29]:
model = KMeans(random_state=1)
visualizer = KElbowVisualizer(model, k=(1, 15), timings=True)
visualizer.fit(k_means_df1)  # fit the data to the visualizer
visualizer.show()  # finalize and render figure
Out[29]:
<Axes: title={'center': 'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>

Silhouette Scores for data1¶

In [30]:
# silhouette scores
silhouette_scores = []
cluster_list = range(2,15)
for k in range(2, 15):
    kmeans = KMeans(n_clusters=k, random_state=1).fit(k_means_df1)
    predictions = kmeans.fit_predict(k_means_df1)
    score = silhouette_score(k_means_df1, predictions)
    silhouette_scores.append(score)
    print("Number of Clusters:", k, "\tAverage Silhouette Score:", score)

plt.plot(cluster_list, silhouette_scores)
plt.xlabel("Number of Clusters")
plt.ylabel("Average Silhouette Score")
plt.title("Selecting k with the Silhouette Method", fontsize=20)
plt.show()
Number of Clusters: 2 	Average Silhouette Score: 0.3077609091830765
Number of Clusters: 3 	Average Silhouette Score: 0.3183211990235526
Number of Clusters: 4 	Average Silhouette Score: 0.34517209690198636
Number of Clusters: 5 	Average Silhouette Score: 0.317171879589086
Number of Clusters: 6 	Average Silhouette Score: 0.321540601063231
Number of Clusters: 7 	Average Silhouette Score: 0.33899989921610313
Number of Clusters: 8 	Average Silhouette Score: 0.30706456773522595
Number of Clusters: 9 	Average Silhouette Score: 0.31159194506849025
Number of Clusters: 10 	Average Silhouette Score: 0.08598883424124792
Number of Clusters: 11 	Average Silhouette Score: 0.08107759543523096
Number of Clusters: 12 	Average Silhouette Score: 0.08805038973278116
Number of Clusters: 13 	Average Silhouette Score: 0.09375702541346734
Number of Clusters: 14 	Average Silhouette Score: 0.09578160880917934
In [31]:
model = KMeans(random_state=1)
visualizer = KElbowVisualizer(model, k=(2,15), metric="silhouette", timings=True)
visualizer.fit(k_means_df1)  # fit the data to the visualizer
visualizer.show()  # finalize and render figure
Out[31]:
<Axes: title={'center': 'Silhouette Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='silhouette score'>
In [32]:
# finding optimal number of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(n_clusters=4, random_state=1))
visualizer.fit(k_means_df1)
visualizer.show()
Out[32]:
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 4 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
In [33]:
# finding optimal number of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(n_clusters=6, random_state=1))
visualizer.fit(k_means_df1)
visualizer.show()
Out[33]:
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 6 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>

The case for four and six clusters are both very strong. I think for both cases, overall the grouping for the clusters is weak. The average silhouette score is below 0.35 for both options. For 4 and 6 clusters, both cases have a cluster with a silhouette score that is more negative than positive. I think we care slightly more about clustering quality than in-cluster variance so I am inclined to lean towards the 4-cluster grouping.

Create Final Model for data1¶

In [34]:
# final kmeans model for data1
kmeans = KMeans(n_clusters=4, random_state=1)
kmeans.fit(k_means_df1)
Out[34]:
KMeans(n_clusters=4, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=4, random_state=1)
In [35]:
# create a copy from original copy
dfkm1 = data.copy() # name comes from data frame kmeans on data 1

# adding kmeans cluster labels to the copy of the original copy and to the scaled data frame data1
dfkm1['Cluster'] = kmeans.labels_
k_means_df1['Cluster'] = kmeans.labels_

Cluster Profiling for data1¶

In [36]:
# cluster profiling
km_cluster_profile_1 = dfkm1.groupby('Cluster').mean(numeric_only=True)
In [37]:
km_cluster_profile_1.head()
Out[37]:
Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio
Cluster
0 75.355204 -7.350601 2.540224 35.975610 148.268293 2.124745e+08 -1.161346e+09 -2.648293 5.918493e+08 92.900103 2.609334
1 83.392891 5.848839 1.376700 34.515789 58.673684 8.323276e+07 1.562137e+09 3.951316 4.322491e+08 24.217338 -2.266772
2 26.990000 -14.060688 3.296307 603.000000 57.333333 -5.850000e+08 -1.755567e+10 -39.726667 4.819101e+08 71.528835 1.638633
3 50.517273 5.747586 1.130399 31.090909 75.909091 -1.072273e+09 1.483309e+10 4.154545 4.298827e+09 14.803577 -4.552119
In [38]:
km_cluster_profile_1["Count of Companies in this Cluster"] = (dfkm1.groupby('Cluster')["Security"].count().values)
In [39]:
km_cluster_profile_1.head()
Out[39]:
Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio Count of Companies in this Cluster
Cluster
0 75.355204 -7.350601 2.540224 35.975610 148.268293 2.124745e+08 -1.161346e+09 -2.648293 5.918493e+08 92.900103 2.609334 41
1 83.392891 5.848839 1.376700 34.515789 58.673684 8.323276e+07 1.562137e+09 3.951316 4.322491e+08 24.217338 -2.266772 285
2 26.990000 -14.060688 3.296307 603.000000 57.333333 -5.850000e+08 -1.755567e+10 -39.726667 4.819101e+08 71.528835 1.638633 3
3 50.517273 5.747586 1.130399 31.090909 75.909091 -1.072273e+09 1.483309e+10 4.154545 4.298827e+09 14.803577 -4.552119 11
In [40]:
# print which companies are in each cluster
for cluster in dfkm1['Cluster'].unique():
    print(f'Cluster {cluster}')
    print(dfkm1[dfkm1['Cluster'] == cluster]['Security'].unique())
    print('')
Cluster 1
['American Airlines Group' 'AbbVie' 'Abbott Laboratories'
 'Adobe Systems Inc' 'Archer-Daniels-Midland Co' 'Alliance Data Systems'
 'Ameren Corp' 'American Electric Power' 'AFLAC Inc'
 'American International Group, Inc.' 'Apartment Investment & Mgmt'
 'Assurant Inc' 'Arthur J. Gallagher & Co.' 'Albemarle Corp'
 'Alaska Air Group Inc' 'Allstate Corp' 'Allegion' 'Applied Materials Inc'
 'AMETEK Inc' 'Affiliated Managers Group Inc' 'Amgen Inc'
 'Ameriprise Financial' 'American Tower Corp A' 'AutoNation Inc'
 'Anthem Inc.' 'Aon plc' 'Amphenol Corp' 'Arconic Inc'
 'Activision Blizzard' 'AvalonBay Communities, Inc.' 'Broadcom'
 'American Water Works Company Inc' 'American Express Co' 'Boeing Company'
 'Bank of America Corp' 'Baxter International Inc.' 'BB&T Corporation'
 'Bard (C.R.) Inc.' 'BIOGEN IDEC Inc.' 'The Bank of New York Mellon Corp.'
 'Ball Corp' 'Bristol-Myers Squibb' 'Boston Scientific' 'BorgWarner'
 'Boston Properties' 'Caterpillar Inc.' 'Chubb Limited' 'CBRE Group'
 'Crown Castle International Corp.' 'Carnival Corp.'
 'CF Industries Holdings Inc' 'Citizens Financial Group' 'Church & Dwight'
 'C. H. Robinson Worldwide' 'Charter Communications' 'CIGNA Corp.'
 'Cincinnati Financial' 'Colgate-Palmolive' 'Comerica Inc.'
 'CME Group Inc.' 'Cummins Inc.' 'CMS Energy' 'Centene Corporation'
 'CenterPoint Energy' 'Capital One Financial' 'The Cooper Companies'
 'CSX Corp.' 'CenturyLink Inc' 'Cognizant Technology Solutions'
 'Citrix Systems' 'CVS Health' 'Chevron Corp.' 'Dominion Resources'
 'Delta Air Lines' 'Du Pont (E.I.)' 'Deere & Co.'
 'Discover Financial Services' 'Quest Diagnostics' 'Danaher Corp.'
 'The Walt Disney Company' 'Discovery Communications-A'
 'Discovery Communications-C' 'Delphi Automotive' 'Digital Realty Trust'
 'Dun & Bradstreet' 'Dover Corp.' 'Dr Pepper Snapple Group' 'Duke Energy'
 'DaVita Inc.' 'eBay Inc.' 'Ecolab Inc.' 'Consolidated Edison'
 'Equifax Inc.' "Edison Int'l" 'Eastman Chemical' 'Equinix'
 'Equity Residential' 'Eversource Energy' 'Essex Property Trust, Inc.'
 'E*Trade' 'Eaton Corporation' 'Entergy Corp.' 'Edwards Lifesciences'
 'Exelon Corp.' "Expeditors Int'l" 'Expedia Inc.' 'Extra Space Storage'
 'Fastenal Co' 'Fortune Brands Home & Security' 'FirstEnergy Corp'
 'Fidelity National Information Services' 'Fiserv Inc' 'FLIR Systems'
 'Fluor Corp.' 'Flowserve Corporation' 'FMC Corporation'
 'Federal Realty Investment Trust' 'First Solar Inc' 'General Dynamics'
 'General Growth Properties Inc.' 'Corning Inc.' 'General Motors'
 'Genuine Parts' 'Garmin Ltd.' 'Goodyear Tire & Rubber'
 'Grainger (W.W.) Inc.' 'Hasbro Inc.' 'Huntington Bancshares'
 'HCA Holdings' 'Welltower Inc.' 'HCP Inc.' 'Hartford Financial Svc.Gp.'
 'Harley-Davidson' "Honeywell Int'l Inc." 'HP Inc.' 'Hormel Foods Corp.'
 'Henry Schein' 'Host Hotels & Resorts' 'The Hershey Company'
 'Humana Inc.' 'International Business Machines' 'IDEXX Laboratories'
 'Intl Flavors & Fragrances' 'International Paper' 'Interpublic Group'
 'Iron Mountain Incorporated' 'Intuitive Surgical Inc.'
 'Illinois Tool Works' 'Invesco Ltd.' 'J. B. Hunt Transport Services'
 'Jacobs Engineering Group' 'Juniper Networks' 'Kimco Realty'
 'Kimberly-Clark' 'Kansas City Southern' 'Leggett & Platt' 'Lennar Corp.'
 'Laboratory Corp. of America Holding' 'LKQ Corporation'
 'L-3 Communications Holdings' 'Lilly (Eli) & Co.' 'Lockheed Martin Corp.'
 'Alliant Energy Corp' 'Leucadia National Corp.' 'Southwest Airlines'
 'Level 3 Communications' 'LyondellBasell' 'Mastercard Inc.'
 'Mid-America Apartments' 'Macerich' "Marriott Int'l." 'Masco Corp.'
 'Mattel Inc.' "McDonald's Corp." "Moody's Corp" 'Mondelez International'
 'MetLife Inc.' 'Mohawk Industries' 'Mead Johnson' 'McCormick & Co.'
 'Martin Marietta Materials' 'Marsh & McLennan' '3M Company'
 'Altria Group Inc' 'Marathon Petroleum' 'Merck & Co.' 'M&T Bank Corp.'
 'Mettler Toledo' 'Mylan N.V.' 'Navient' 'NASDAQ OMX Group'
 'NextEra Energy' 'Nielsen Holdings' 'Norfolk Southern Corp.'
 'Northern Trust Corp.' 'Nucor Corp.' 'Newell Brands'
 'Realty Income Corporation' 'Omnicom Group' "O'Reilly Automotive"
 "People's United Financial" 'Pitney-Bowes' 'PACCAR Inc.' 'PG&E Corp.'
 'Priceline.com Inc' 'Public Serv. Enterprise Inc.' 'PepsiCo Inc.'
 'Principal Financial Group' 'Procter & Gamble' 'Progressive Corp.'
 'Pulte Homes Inc.' 'Philip Morris International' 'PNC Financial Services'
 'Pentair Ltd.' 'Pinnacle West Capital' 'PPG Industries' 'PPL Corp.'
 'Prudential Financial' 'Phillips 66' 'Praxair Inc.' 'PayPal'
 'Ryder System' 'Royal Caribbean Cruises Ltd' 'Regeneron'
 'Robert Half International' 'Roper Industries' 'Republic Services Inc'
 'SCANA Corp' 'Charles Schwab Corporation' 'Sealed Air' 'Sherwin-Williams'
 'SL Green Realty' 'Scripps Networks Interactive Inc.' 'Southern Co.'
 'Simon Property Group Inc' 'S&P Global, Inc.' 'Stericycle Inc'
 'Sempra Energy' 'SunTrust Banks' 'State Street Corp.'
 'Synchrony Financial' 'Stryker Corp.' 'Molson Coors Brewing Company'
 'Tegna, Inc.' 'Torchmark Corp.' 'Thermo Fisher Scientific' 'TripAdvisor'
 'The Travelers Companies Inc.' 'Tractor Supply Company' 'Tyson Foods'
 'Tesoro Petroleum Co.' 'Total System Services' 'Texas Instruments'
 'Under Armour' 'United Continental Holdings' 'UDR Inc'
 'Universal Health Services, Inc.' 'United Health Group Inc.' 'Unum Group'
 'Union Pacific' 'United Parcel Service' 'United Technologies'
 'Varian Medical Systems' 'Valero Energy' 'Vulcan Materials'
 'Vornado Realty Trust' 'Verisk Analytics' 'Verisign Inc.' 'Ventas Inc'
 'Waters Corporation' 'Wec Energy Group Inc' 'Whirlpool Corp.'
 'Waste Management Inc.' 'Western Union Co' 'Weyerhaeuser Corp.'
 'Wyndham Worldwide' 'Xcel Energy Inc' 'XL Capital' 'Dentsply Sirona'
 'Xerox Corp.' 'Xylem Inc.' 'Yum! Brands Inc' 'Zimmer Biomet Holdings'
 'Zions Bancorp' 'Zoetis']

Cluster 0
['Analog Devices, Inc.' 'Akamai Technologies Inc'
 'Alexion Pharmaceuticals' 'Amazon.com Inc' 'Anadarko Petroleum Corp'
 'Baker Hughes Inc' 'Celgene Corp.' 'Chipotle Mexican Grill'
 'Cabot Oil & Gas' 'Concho Resources' 'EOG Resources' 'EQT Corporation'
 'Facebook' 'Freeport-McMoran Cp & Gld' 'Frontier Communications'
 'Halliburton Co.' 'Hess Corporation' 'Hewlett Packard Enterprise'
 'Kinder Morgan' 'Monster Beverage' 'The Mosaic Company'
 'Marathon Oil Corp.' 'Murphy Oil' 'Noble Energy Inc'
 'Newmont Mining Corp. (Hldg. Co.)' 'Netflix Inc.'
 'Newfield Exploration Co' 'National Oilwell Varco Inc.' 'ONEOK'
 'Occidental Petroleum' 'Quanta Services Inc.' 'Range Resources Corp.'
 'Spectra Energy Corp.' 'Skyworks Solutions' 'Southwestern Energy'
 'Teradata Corp.' 'Vertex Pharmaceuticals Inc' 'Williams Cos.'
 'Wynn Resorts Ltd' 'Cimarex Energy' 'Yahoo Inc.']

Cluster 2
['Apache Corporation' 'Chesapeake Energy' 'Devon Energy Corp.']

Cluster 3
['Citigroup Inc.' 'Ford Motor' 'Gilead Sciences' 'Intel Corp.'
 'JPMorgan Chase & Co.' 'Coca Cola Company' 'Pfizer Inc.' 'AT&T Inc'
 'Verizon Communications' 'Wells Fargo' 'Exxon Mobil Corp.']

In [41]:
dfkm1.groupby(["Cluster", "GICS Sector"])['Security'].count()
Out[41]:
Security
Cluster GICS Sector
0 Consumer Discretionary 3
Consumer Staples 1
Energy 21
Health Care 3
Industrials 1
Information Technology 8
Materials 3
Telecommunications Services 1
1 Consumer Discretionary 36
Consumer Staples 17
Energy 5
Financials 46
Health Care 35
Industrials 52
Information Technology 24
Materials 17
Real Estate 27
Telecommunications Services 2
Utilities 24
2 Energy 3
3 Consumer Discretionary 1
Consumer Staples 1
Energy 1
Financials 3
Health Care 2
Information Technology 1
Telecommunications Services 2

In [42]:
plt.figure(figsize=(20, 20))
plt.suptitle("Boxplot of numerical variables for each cluster")

# selecting numerical columns
num_col = dfkm1.select_dtypes(include=np.number).columns.tolist()

for i, variable in enumerate(num_col):
    plt.subplot(3, 4, i + 1)
    sns.boxplot(data=dfkm1, x="Cluster", y=variable)

plt.tight_layout(pad=2.0)

Insights for KMeans for data1¶

  • Cluster 1 is the group containing the most outlier data.
  • Clusters 0 and 2 have had negative price changes on average whereas clusters 1 and 3 have had postive price changes on average.
  • Cluster 1 had the highest variance of Price Change.
  • Cluster 2 (the smallest cluster) has the most Volatility, ROE, the leastNet Income and the least Earnings per Share. These companies are all in the Energy industry as well.
  • Cluster 3 (which is small) had higher variance of Net Cash Flow, lower variance and total value of P/E Ratio, and higher Estimated Shares Outstanding and Net Income than the other clusters.

K-Means Clustering on data2¶

As a reminder: data2 is a copy of the data frame that drops both GICS Sector and GICS Sub Industry. This data frame may lose too much information to be useful but will not suffer from the curse of dimensionality at all

In [43]:
# make a copy of data1 so we don't change it so we can use data1 for hierarchical clustering later
k_means_df2 = data2.copy()

Elbow Score for data2¶

In [44]:
clusters = range(1, 15)
meanDistortions = []

for k in clusters:
    model = KMeans(n_clusters=k, random_state=1)
    model.fit(data2)
    prediction = model.predict(k_means_df2)
    distortion = (
        sum(np.min(cdist(k_means_df2, model.cluster_centers_, "euclidean"), axis=1))
        / k_means_df1.shape[0]
    )

    meanDistortions.append(distortion)

    print("Number of Clusters:", k, "\tAverage Distortion:", distortion)

plt.plot(clusters, meanDistortions, "bx-")
plt.xlabel("k")
plt.ylabel("Average Distortion")
plt.title("Selecting k with the Elbow Method", fontsize=20)
plt.show()
Number of Clusters: 1 	Average Distortion: 2.5425069919221697
Number of Clusters: 2 	Average Distortion: 2.3862098789299604
Number of Clusters: 3 	Average Distortion: 2.33620927590848
Number of Clusters: 4 	Average Distortion: 2.219050563833442
Number of Clusters: 5 	Average Distortion: 2.133404401901685
Number of Clusters: 6 	Average Distortion: 2.081503686093715
Number of Clusters: 7 	Average Distortion: 2.0045413402786814
Number of Clusters: 8 	Average Distortion: 1.9864237824874411
Number of Clusters: 9 	Average Distortion: 1.956222103389025
Number of Clusters: 10 	Average Distortion: 1.9360473996664198
Number of Clusters: 11 	Average Distortion: 1.8615942883461607
Number of Clusters: 12 	Average Distortion: 1.8219574388532505
Number of Clusters: 13 	Average Distortion: 1.7936924742607907
Number of Clusters: 14 	Average Distortion: 1.7567842179093438
In [45]:
model = KMeans(random_state=1)
visualizer = KElbowVisualizer(model, k=(1, 15), timings=True)
visualizer.fit(k_means_df2)  # fit the data to the visualizer
visualizer.show()  # finalize and render figure
Out[45]:
<Axes: title={'center': 'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>

Silhouette Scores for data2¶

In [46]:
# silhouette scores
silhouette_scores = []
cluster_list = range(2,15)
for k in range(2, 15):
    kmeans = KMeans(n_clusters=k, random_state=1).fit(k_means_df2)
    predictions = kmeans.fit_predict(k_means_df2)
    score = silhouette_score(k_means_df2, predictions)
    silhouette_scores.append(score)
    print("Number of Clusters:", k, "\tAverage Silhouette Score:", score)

plt.plot(cluster_list, silhouette_scores)
plt.xlabel("Number of Clusters")
plt.ylabel("Average Silhouette Score")
plt.title("Selecting k with the Silhouette Method", fontsize=20)
plt.show()
Number of Clusters: 2 	Average Silhouette Score: 0.45335782729503565
Number of Clusters: 3 	Average Silhouette Score: 0.40374060030338865
Number of Clusters: 4 	Average Silhouette Score: 0.4246430808437099
Number of Clusters: 5 	Average Silhouette Score: 0.4381539778147092
Number of Clusters: 6 	Average Silhouette Score: 0.40869599703024256
Number of Clusters: 7 	Average Silhouette Score: 0.1207450219233897
Number of Clusters: 8 	Average Silhouette Score: 0.3693991650696542
Number of Clusters: 9 	Average Silhouette Score: 0.35185096182499204
Number of Clusters: 10 	Average Silhouette Score: 0.32950073703610283
Number of Clusters: 11 	Average Silhouette Score: 0.1486586842527321
Number of Clusters: 12 	Average Silhouette Score: 0.15784241071085106
Number of Clusters: 13 	Average Silhouette Score: 0.15646997458716602
Number of Clusters: 14 	Average Silhouette Score: 0.16253506827999134
In [47]:
model = KMeans(random_state=1)
visualizer = KElbowVisualizer(model, k=(2,15), metric="silhouette", timings=True)
visualizer.fit(k_means_df2)  # fit the data to the visualizer
visualizer.show()  # finalize and render figure
Out[47]:
<Axes: title={'center': 'Silhouette Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='silhouette score'>

Between the silhouette score and the elbow method, clusters of 2, 5, 6 and 8 seem lik they're worth exploring.

In [48]:
# finding optimal number of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(n_clusters=2, random_state=1))
visualizer.fit(k_means_df2)
visualizer.show()
Out[48]:
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 2 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
In [49]:
# finding optimal number of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(n_clusters=5, random_state=1))
visualizer.fit(k_means_df2)
visualizer.show()
Out[49]:
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 5 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
In [50]:
# finding optimal number of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(n_clusters=6, random_state=1))
visualizer.fit(k_means_df2)
visualizer.show()
Out[50]:
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 6 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
In [51]:
# finding optimal number of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(n_clusters=8, random_state=1))
visualizer.fit(k_means_df2)
visualizer.show()
Out[51]:
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 8 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>

Once again, the highest silhouette scores aren't very high and our model is putting the majority of data points into one cluster. There is very little useful information to be obtained from this model or from K-means on data1. I am going to go with 5 clusters because it has the second highest silhouette score while still explaining a decent amount of the in-cluster variance.

Create Final Model for data2¶

In [52]:
# final kmeans model for data1
kmeans = KMeans(n_clusters=5, random_state=1)
kmeans.fit(k_means_df2)
Out[52]:
KMeans(n_clusters=5, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=5, random_state=1)
In [53]:
# create a copy from original copy
dfkm2 = data.copy() # name comes from data frame kmeans on data 1

# adding kmeans cluster labels to the copy of the original copy and to the scaled data frame data1
dfkm2['Cluster'] = kmeans.labels_
k_means_df2['Cluster'] = kmeans.labels_

Cluster Profiling for data2¶

In [54]:
# cluster profiling
km_cluster_profile_2 = dfkm2.groupby('Cluster').mean(numeric_only=True)
In [55]:
km_cluster_profile_2["Count of Companies in this Cluster"] = (dfkm2.groupby('Cluster')["Security"].count().values)
km_cluster_profile_2.head()
Out[55]:
Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio Count of Companies in this Cluster
Cluster
0 72.738269 5.179897 1.380738 34.825455 53.138182 -1.014729e+07 1.488642e+09 3.636164 4.379616e+08 23.680917 -3.395254 275
1 65.106668 -11.888125 2.722141 44.000000 61.400000 -3.685830e+07 -2.137169e+09 -5.560333 5.297142e+08 113.488924 0.905486 30
2 24.485001 -13.351992 3.482611 802.000000 51.000000 -1.292500e+09 -1.910650e+10 -41.815000 5.195740e+08 60.748608 1.565141 2
3 46.672222 5.166566 1.079367 25.000000 58.333333 -3.040667e+09 1.484844e+10 3.435556 4.564960e+09 15.596051 -6.354193 9
4 211.164720 12.456786 1.699388 30.708333 280.250000 2.197085e+09 2.808601e+09 6.818333 7.389574e+08 37.895420 15.682619 24
In [56]:
# print which companies are in each cluster
for cluster in dfkm2['Cluster'].unique():
    print(f'Cluster {cluster}')
    print(dfkm2[dfkm2['Cluster'] == cluster]['Security'].unique())
    print('')
Cluster 0
['American Airlines Group' 'AbbVie' 'Abbott Laboratories'
 'Adobe Systems Inc' 'Archer-Daniels-Midland Co' 'Ameren Corp'
 'American Electric Power' 'AFLAC Inc'
 'American International Group, Inc.' 'Apartment Investment & Mgmt'
 'Assurant Inc' 'Arthur J. Gallagher & Co.' 'Akamai Technologies Inc'
 'Albemarle Corp' 'Alaska Air Group Inc' 'Allstate Corp' 'Allegion'
 'Applied Materials Inc' 'AMETEK Inc' 'Affiliated Managers Group Inc'
 'Ameriprise Financial' 'American Tower Corp A' 'AutoNation Inc'
 'Anthem Inc.' 'Aon plc' 'Amphenol Corp' 'Arconic Inc'
 'Activision Blizzard' 'AvalonBay Communities, Inc.' 'Broadcom'
 'American Water Works Company Inc' 'American Express Co' 'Boeing Company'
 'Baxter International Inc.' 'BB&T Corporation' 'Bard (C.R.) Inc.'
 'The Bank of New York Mellon Corp.' 'Ball Corp' 'Bristol-Myers Squibb'
 'Boston Scientific' 'BorgWarner' 'Boston Properties' 'Caterpillar Inc.'
 'Chubb Limited' 'CBRE Group' 'Crown Castle International Corp.'
 'Carnival Corp.' 'CF Industries Holdings Inc' 'Citizens Financial Group'
 'Church & Dwight' 'C. H. Robinson Worldwide' 'Charter Communications'
 'CIGNA Corp.' 'Cincinnati Financial' 'Colgate-Palmolive' 'Comerica Inc.'
 'CME Group Inc.' 'Cummins Inc.' 'CMS Energy' 'Centene Corporation'
 'CenterPoint Energy' 'Capital One Financial' 'The Cooper Companies'
 'CSX Corp.' 'CenturyLink Inc' 'Cognizant Technology Solutions'
 'Citrix Systems' 'CVS Health' 'Chevron Corp.' 'Dominion Resources'
 'Delta Air Lines' 'Du Pont (E.I.)' 'Deere & Co.'
 'Discover Financial Services' 'Quest Diagnostics' 'Danaher Corp.'
 'The Walt Disney Company' 'Discovery Communications-A'
 'Discovery Communications-C' 'Delphi Automotive' 'Digital Realty Trust'
 'Dun & Bradstreet' 'Dover Corp.' 'Dr Pepper Snapple Group' 'Duke Energy'
 'DaVita Inc.' 'eBay Inc.' 'Ecolab Inc.' 'Consolidated Edison'
 'Equifax Inc.' "Edison Int'l" 'Eastman Chemical' 'Equity Residential'
 'Eversource Energy' 'Essex Property Trust, Inc.' 'E*Trade'
 'Eaton Corporation' 'Entergy Corp.' 'Exelon Corp.' "Expeditors Int'l"
 'Expedia Inc.' 'Extra Space Storage' 'Fastenal Co'
 'Fortune Brands Home & Security' 'FirstEnergy Corp'
 'Fidelity National Information Services' 'Fiserv Inc' 'FLIR Systems'
 'Fluor Corp.' 'Flowserve Corporation' 'FMC Corporation'
 'Federal Realty Investment Trust' 'General Dynamics'
 'General Growth Properties Inc.' 'Corning Inc.' 'General Motors'
 'Genuine Parts' 'Garmin Ltd.' 'Goodyear Tire & Rubber'
 'Grainger (W.W.) Inc.' 'Hasbro Inc.' 'Huntington Bancshares'
 'HCA Holdings' 'Welltower Inc.' 'HCP Inc.' 'Hartford Financial Svc.Gp.'
 'Harley-Davidson' "Honeywell Int'l Inc." 'HP Inc.' 'Hormel Foods Corp.'
 'Henry Schein' 'Host Hotels & Resorts' 'The Hershey Company'
 'Humana Inc.' 'International Business Machines' 'IDEXX Laboratories'
 'Intl Flavors & Fragrances' 'International Paper' 'Interpublic Group'
 'Iron Mountain Incorporated' 'Illinois Tool Works' 'Invesco Ltd.'
 'J. B. Hunt Transport Services' 'Jacobs Engineering Group'
 'Juniper Networks' 'Kimco Realty' 'Kimberly-Clark' 'Kansas City Southern'
 'Leggett & Platt' 'Lennar Corp.' 'Laboratory Corp. of America Holding'
 'LKQ Corporation' 'L-3 Communications Holdings' 'Lilly (Eli) & Co.'
 'Lockheed Martin Corp.' 'Alliant Energy Corp' 'Leucadia National Corp.'
 'Southwest Airlines' 'Level 3 Communications' 'LyondellBasell'
 'Mastercard Inc.' 'Mid-America Apartments' 'Macerich' "Marriott Int'l."
 'Masco Corp.' 'Mattel Inc.' "Moody's Corp" 'Mondelez International'
 'MetLife Inc.' 'Mohawk Industries' 'Mead Johnson' 'McCormick & Co.'
 'Martin Marietta Materials' 'Marsh & McLennan' '3M Company'
 'Altria Group Inc' 'Marathon Petroleum' 'Merck & Co.' 'M&T Bank Corp.'
 'Mettler Toledo' 'Mylan N.V.' 'Navient' 'NASDAQ OMX Group'
 'NextEra Energy' 'Newmont Mining Corp. (Hldg. Co.)' 'Nielsen Holdings'
 'Norfolk Southern Corp.' 'Northern Trust Corp.' 'Nucor Corp.'
 'Newell Brands' 'Realty Income Corporation' 'Omnicom Group'
 "O'Reilly Automotive" "People's United Financial" 'Pitney-Bowes'
 'PACCAR Inc.' 'PG&E Corp.' 'Public Serv. Enterprise Inc.' 'PepsiCo Inc.'
 'Principal Financial Group' 'Procter & Gamble' 'Progressive Corp.'
 'Pulte Homes Inc.' 'Philip Morris International' 'PNC Financial Services'
 'Pentair Ltd.' 'Pinnacle West Capital' 'PPG Industries' 'PPL Corp.'
 'Prudential Financial' 'Phillips 66' 'Praxair Inc.' 'PayPal'
 'Ryder System' 'Royal Caribbean Cruises Ltd' 'Robert Half International'
 'Roper Industries' 'Republic Services Inc' 'SCANA Corp'
 'Charles Schwab Corporation' 'Sealed Air' 'Sherwin-Williams'
 'SL Green Realty' 'Scripps Networks Interactive Inc.' 'Southern Co.'
 'Simon Property Group Inc' 'S&P Global, Inc.' 'Stericycle Inc'
 'Sempra Energy' 'SunTrust Banks' 'State Street Corp.'
 'Skyworks Solutions' 'Synchrony Financial' 'Stryker Corp.'
 'Molson Coors Brewing Company' 'Tegna, Inc.' 'Torchmark Corp.'
 'Thermo Fisher Scientific' 'The Travelers Companies Inc.'
 'Tractor Supply Company' 'Tyson Foods' 'Tesoro Petroleum Co.'
 'Total System Services' 'Texas Instruments' 'Under Armour'
 'United Continental Holdings' 'UDR Inc' 'Universal Health Services, Inc.'
 'United Health Group Inc.' 'Unum Group' 'Union Pacific'
 'United Parcel Service' 'United Technologies' 'Varian Medical Systems'
 'Valero Energy' 'Vulcan Materials' 'Vornado Realty Trust'
 'Verisk Analytics' 'Verisign Inc.' 'Ventas Inc' 'Wec Energy Group Inc'
 'Whirlpool Corp.' 'Waste Management Inc.' 'Western Union Co'
 'Weyerhaeuser Corp.' 'Wyndham Worldwide' 'Xcel Energy Inc' 'XL Capital'
 'Dentsply Sirona' 'Xerox Corp.' 'Xylem Inc.' 'Yum! Brands Inc'
 'Zimmer Biomet Holdings' 'Zions Bancorp' 'Zoetis']

Cluster 1
['Analog Devices, Inc.' 'Alexion Pharmaceuticals' 'Amazon.com Inc'
 'Anadarko Petroleum Corp' 'Baker Hughes Inc' 'Cabot Oil & Gas'
 'Concho Resources' 'Devon Energy Corp.' 'EOG Resources' 'EQT Corporation'
 'Freeport-McMoran Cp & Gld' 'Hess Corporation'
 'Hewlett Packard Enterprise' 'Kinder Morgan' 'The Mosaic Company'
 'Marathon Oil Corp.' 'Murphy Oil' 'Noble Energy Inc' 'Netflix Inc.'
 'Newfield Exploration Co' 'National Oilwell Varco Inc.' 'ONEOK'
 'Occidental Petroleum' 'Quanta Services Inc.' 'Range Resources Corp.'
 'Spectra Energy Corp.' 'Southwestern Energy' 'Teradata Corp.'
 'Williams Cos.' 'Cimarex Energy']

Cluster 4
['Alliance Data Systems' 'Amgen Inc' 'Bank of America Corp'
 'BIOGEN IDEC Inc.' 'Celgene Corp.' 'Chipotle Mexican Grill' 'Equinix'
 'Edwards Lifesciences' 'Facebook' 'First Solar Inc'
 'Frontier Communications' 'Gilead Sciences' 'Halliburton Co.'
 'Intel Corp.' 'Intuitive Surgical Inc.' "McDonald's Corp."
 'Monster Beverage' 'Priceline.com Inc' 'Regeneron' 'TripAdvisor'
 'Vertex Pharmaceuticals Inc' 'Waters Corporation' 'Wynn Resorts Ltd'
 'Yahoo Inc.']

Cluster 2
['Apache Corporation' 'Chesapeake Energy']

Cluster 3
['Citigroup Inc.' 'Ford Motor' 'JPMorgan Chase & Co.' 'Coca Cola Company'
 'Pfizer Inc.' 'AT&T Inc' 'Verizon Communications' 'Wells Fargo'
 'Exxon Mobil Corp.']

In [57]:
dfkm2.groupby(["Cluster", "GICS Sector"])['Security'].count()
Out[57]:
Security
Cluster GICS Sector
0 Consumer Discretionary 33
Consumer Staples 17
Energy 5
Financials 45
Health Care 29
Industrials 52
Information Technology 24
Materials 18
Real Estate 26
Telecommunications Services 2
Utilities 24
1 Consumer Discretionary 1
Energy 21
Health Care 1
Industrials 1
Information Technology 4
Materials 2
2 Energy 2
3 Consumer Discretionary 1
Consumer Staples 1
Energy 1
Financials 3
Health Care 1
Telecommunications Services 2
4 Consumer Discretionary 5
Consumer Staples 1
Energy 1
Financials 1
Health Care 9
Information Technology 5
Real Estate 1
Telecommunications Services 1

In [58]:
plt.figure(figsize=(20, 20))
plt.suptitle("Boxplot of numerical variables for each cluster")

# selecting numerical columns
num_col = dfkm2.select_dtypes(include=np.number).columns.tolist()

for i, variable in enumerate(num_col):
    plt.subplot(3, 4, i + 1)
    sns.boxplot(data=dfkm2, x="Cluster", y=variable)

plt.tight_layout(pad=2.0)

Insights for KMeans for data2¶

  • Clusters 2 and 3 are the smallest clusters so we would expect to see higher variance in each boxplot. For a few variables this is true but not for all of them. This suggests the companies in clusters 2 and 3 are highly similar to one another. When we look at the SilhouetteVisualizer for 5 clusters for this data set, we see they both have almost only positive Silhouette Coefficients so this supports the observation.
  • Cluster 0 seems to have caught most of the outliers this time. Despite this, it does not have a high interquartile range for any variable. This suggests that except for the outliers, companies in cluster 0 are very similar to each other.
  • Cluster 0 has neither the highest or lowest average in almost any variable. Because about 80% of the data points ended up in this cluster, this is to be expected.
  • Cluster 1 had the highest P/E Ratio on average. It also had the second highest average Volatility and the second lowest average Price Change.
  • Cluster 2 had the highest ROE by far as well as the highest Volatility. It had the lowest Net Income and the lowest Earnings Per Share. This is the smallest cluster with only two Energy companies. These two companies are probably two of the same companies we saw in our smallest cluster in K-means for data1.
  • Cluster 3 had the highest Net Income and the highest Estimated Outstanding Shares, because this high net income was divided over so many shares, it makes sense that the ROE was approximately the same as the other clusters.
  • Cluster 4 had the highest average positive Price Change, the highest average Current Price, the highest average Cash Ratio, and the highest average P/B Ratio.

Hierarchical Clustering on data1¶

As a reminder: data1 is a copy of the data frame with both GICS Sector and GICS Sub Industry dummy encoded. The concern here will be curse of dimensionality as we are increasing the number of columns from 13 to 124 even with drop_first=True

In [59]:
hc_df1 = data1.copy()

Cophentic Correlations for data1¶

In [60]:
# distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]

# linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]

high_cophenet_corr = 0
high_dm_lm = [0, 0]

for dm in distance_metrics:
    for lm in linkage_methods:
        Z = linkage(hc_df1, metric=dm, method=lm)
        c, coph_dists = cophenet(Z, pdist(hc_df1))
        print(
            "Cophenetic correlation for {} distance and {} linkage is {}.".format(
                dm.capitalize(), lm, c
            )
        )
        if high_cophenet_corr < c:
            high_cophenet_corr = c
            high_dm_lm[0] = dm
            high_dm_lm[1] = lm

print(
    "The highest cophenetic correlation is {} for {} distance and {} linkage.".format(
        high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
    )
)
Cophenetic correlation for Euclidean distance and single linkage is 0.9192696289462164.
Cophenetic correlation for Euclidean distance and complete linkage is 0.5839761524776856.
Cophenetic correlation for Euclidean distance and average linkage is 0.9387918818713881.
Cophenetic correlation for Euclidean distance and weighted linkage is 0.9017395996624018.
Cophenetic correlation for Chebyshev distance and single linkage is 0.8733698250376598.
Cophenetic correlation for Chebyshev distance and complete linkage is 0.7629621936822496.
Cophenetic correlation for Chebyshev distance and average linkage is 0.9194190244129151.
Cophenetic correlation for Chebyshev distance and weighted linkage is 0.8866543829752598.
Cophenetic correlation for Mahalanobis distance and single linkage is -0.018111220588679713.
Cophenetic correlation for Mahalanobis distance and complete linkage is 0.10942269236725295.
Cophenetic correlation for Mahalanobis distance and average linkage is 0.07162884851014278.
Cophenetic correlation for Mahalanobis distance and weighted linkage is 0.05628335494200424.
Cophenetic correlation for Cityblock distance and single linkage is 0.9110913485411501.
Cophenetic correlation for Cityblock distance and complete linkage is 0.6986069960985519.
Cophenetic correlation for Cityblock distance and average linkage is 0.891363494361917.
Cophenetic correlation for Cityblock distance and weighted linkage is 0.6244033214763337.
The highest cophenetic correlation is 0.9387918818713881 for Euclidean distance and average linkage.

It looks like Euclidean distance and Average linkage are the best options for data1.

In [61]:
# let's check linkage types for Euclidean only
linkage_types = ["single", "complete", "average", "weighted", "centroid", "median", "ward"]

high_cophenet_corr = 0
high_dm_lm = [0, 0]

for lm in linkage_types:
    Z = linkage(hc_df1, metric="euclidean", method=lm)
    c, coph_dists = cophenet(Z, pdist(hc_df1))
    print(
        "Cophenetic correlation for Euclidean distance and {} linkage is {}.".format(
            lm, c
        )
    )
    if high_cophenet_corr < c:
        high_cophenet_corr = c
        high_dm_lm[0] = "Euclidean"
        high_dm_lm[1] = lm

# let's print the linkage method that works best for Euclidean
print(
    "The highest cophenetic correlation is {} for {} distance and {} linkage.".format(
        high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
    )
)
Cophenetic correlation for Euclidean distance and single linkage is 0.9192696289462164.
Cophenetic correlation for Euclidean distance and complete linkage is 0.5839761524776856.
Cophenetic correlation for Euclidean distance and average linkage is 0.9387918818713881.
Cophenetic correlation for Euclidean distance and weighted linkage is 0.9017395996624018.
Cophenetic correlation for Euclidean distance and centroid linkage is 0.946013261636426.
Cophenetic correlation for Euclidean distance and median linkage is 0.914449553156567.
Cophenetic correlation for Euclidean distance and ward linkage is 0.6910854135257704.
The highest cophenetic correlation is 0.946013261636426 for Euclidean distance and centroid linkage.

We improved the cophentic correlation by about 0.01 by doing this.

Checking Dendograms for data1¶

In [62]:
# linkage types
linkage_types = ["single", "complete", "average", "centroid", "ward", "weighted", "median"]

# lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
compare = []

# to create a subplot image
fig, axs = plt.subplots(len(linkage_types), 1, figsize=(15, 30))

# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_types):
    Z = linkage(hc_df1, metric="euclidean", method=method)

    dendrogram(Z, ax=axs[i])
    axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")

    coph_corr, coph_dist = cophenet(Z, pdist(hc_df1))
    axs[i].annotate(
        f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
        (0.80, 0.80),
        xycoords="axes fraction",
    )
    compare.append([method, coph_corr])
In [63]:
# create and print a dataframe to compare cophenetic correlations for different linkage methods
df_cc1 = pd.DataFrame(compare, columns=compare_cols)
df_cc1 = df_cc1.sort_values(by="Cophenetic Coefficient")
df_cc1
Out[63]:
Linkage Cophenetic Coefficient
1 complete 0.583976
4 ward 0.691085
5 weighted 0.901740
6 median 0.914450
0 single 0.919270
2 average 0.938792
3 centroid 0.946013

Creating model using sklearn for data1¶

In [64]:
# create a model using 4 clusters, and average linkage
HCmodel1 = AgglomerativeClustering(n_clusters=5, linkage="ward")
HCmodel1.fit(hc_df1)

# adding the cluster labels to the a copy of the original data frame as well as the scaled data frame
dfhc1 = data.copy()
hc_df1["Cluster"] = HCmodel1.labels_ # the scaled data
dfhc1["Cluster"] = HCmodel1.labels_ # the non-scaled data

Note: Even though Centroid had the highest cophentic coefficient, Centroid cannot be passed as an argument for the linkage parameter in AgglomerativeClustering. We will use the 'Ward' model because even though it has a somewhat lower cophenetic score, it divides the data into the most varied clusters. Every other method puts more than 90% of the data in one cluster unless there are dozens of clusters.

Cluster Profiling for data1¶

In [65]:
hc_cluster_profile_1 = dfhc1.groupby("Cluster").mean(numeric_only=True)
hc_cluster_profile_1["Count of Companies in this Cluster"] = (
    dfhc1.groupby("Cluster")["Security"].count().values)
In [66]:
# let's exampine which companies are in each cluster
for cluster in dfhc1['Cluster'].unique():
    print(f'Cluster {cluster}')
    print(dfhc1[dfhc1['Cluster'] == cluster]['Security'].unique())
    print('')
Cluster 3
['American Airlines Group' 'AbbVie' 'Abbott Laboratories'
 'Adobe Systems Inc' 'Archer-Daniels-Midland Co' 'Ameren Corp'
 'American Electric Power' 'AFLAC Inc'
 'American International Group, Inc.' 'Apartment Investment & Mgmt'
 'Assurant Inc' 'Arthur J. Gallagher & Co.' 'Akamai Technologies Inc'
 'Albemarle Corp' 'Alaska Air Group Inc' 'Allstate Corp'
 'Applied Materials Inc' 'AMETEK Inc' 'Affiliated Managers Group Inc'
 'Ameriprise Financial' 'American Tower Corp A' 'AutoNation Inc'
 'Anthem Inc.' 'Aon plc' 'Amphenol Corp' 'Arconic Inc'
 'Activision Blizzard' 'AvalonBay Communities, Inc.' 'Broadcom'
 'American Water Works Company Inc' 'American Express Co' 'Boeing Company'
 'Baxter International Inc.' 'BB&T Corporation' 'Bard (C.R.) Inc.'
 'BIOGEN IDEC Inc.' 'The Bank of New York Mellon Corp.' 'Ball Corp'
 'Bristol-Myers Squibb' 'Boston Scientific' 'BorgWarner'
 'Boston Properties' 'Caterpillar Inc.' 'Chubb Limited' 'CBRE Group'
 'Crown Castle International Corp.' 'Carnival Corp.'
 'CF Industries Holdings Inc' 'Citizens Financial Group' 'Church & Dwight'
 'C. H. Robinson Worldwide' 'CIGNA Corp.' 'Cincinnati Financial'
 'Comerica Inc.' 'CME Group Inc.' 'Cummins Inc.' 'CMS Energy'
 'Centene Corporation' 'CenterPoint Energy' 'Capital One Financial'
 'The Cooper Companies' 'CSX Corp.' 'CenturyLink Inc'
 'Cognizant Technology Solutions' 'Citrix Systems' 'CVS Health'
 'Chevron Corp.' 'Dominion Resources' 'Delta Air Lines' 'Du Pont (E.I.)'
 'Deere & Co.' 'Discover Financial Services' 'Quest Diagnostics'
 'Danaher Corp.' 'The Walt Disney Company' 'Discovery Communications-A'
 'Discovery Communications-C' 'Delphi Automotive' 'Digital Realty Trust'
 'Dun & Bradstreet' 'Dover Corp.' 'Dr Pepper Snapple Group' 'Duke Energy'
 'DaVita Inc.' 'eBay Inc.' 'Ecolab Inc.' 'Consolidated Edison'
 'Equifax Inc.' "Edison Int'l" 'Eastman Chemical' 'Equity Residential'
 'Eversource Energy' 'Essex Property Trust, Inc.' 'E*Trade'
 'Eaton Corporation' 'Entergy Corp.' 'Exelon Corp.' "Expeditors Int'l"
 'Expedia Inc.' 'Extra Space Storage' 'Fastenal Co'
 'Fortune Brands Home & Security' 'FirstEnergy Corp'
 'Fidelity National Information Services' 'Fiserv Inc' 'FLIR Systems'
 'Fluor Corp.' 'Flowserve Corporation' 'FMC Corporation'
 'Federal Realty Investment Trust' 'First Solar Inc' 'General Dynamics'
 'General Growth Properties Inc.' 'Gilead Sciences' 'Corning Inc.'
 'General Motors' 'Genuine Parts' 'Garmin Ltd.' 'Goodyear Tire & Rubber'
 'Grainger (W.W.) Inc.' 'Hasbro Inc.' 'Huntington Bancshares'
 'HCA Holdings' 'Welltower Inc.' 'HCP Inc.' 'Hartford Financial Svc.Gp.'
 'Harley-Davidson' "Honeywell Int'l Inc." 'HP Inc.' 'Hormel Foods Corp.'
 'Henry Schein' 'Host Hotels & Resorts' 'The Hershey Company'
 'Humana Inc.' 'International Business Machines' 'IDEXX Laboratories'
 'Intl Flavors & Fragrances' 'International Paper' 'Interpublic Group'
 'Iron Mountain Incorporated' 'Illinois Tool Works' 'Invesco Ltd.'
 'J. B. Hunt Transport Services' 'Jacobs Engineering Group'
 'Juniper Networks' 'Kimco Realty' 'Kansas City Southern'
 'Leggett & Platt' 'Lennar Corp.' 'Laboratory Corp. of America Holding'
 'LKQ Corporation' 'L-3 Communications Holdings' 'Lilly (Eli) & Co.'
 'Lockheed Martin Corp.' 'Alliant Energy Corp' 'Leucadia National Corp.'
 'Southwest Airlines' 'Level 3 Communications' 'LyondellBasell'
 'Mastercard Inc.' 'Mid-America Apartments' 'Macerich' "Marriott Int'l."
 'Masco Corp.' 'Mattel Inc.' "Moody's Corp" 'Mondelez International'
 'MetLife Inc.' 'Mohawk Industries' 'Mead Johnson' 'McCormick & Co.'
 'Martin Marietta Materials' 'Marsh & McLennan' '3M Company'
 'Altria Group Inc' 'The Mosaic Company' 'Marathon Petroleum'
 'Merck & Co.' 'M&T Bank Corp.' 'Mettler Toledo' 'Mylan N.V.' 'Navient'
 'NASDAQ OMX Group' 'NextEra Energy' 'Newmont Mining Corp. (Hldg. Co.)'
 'Nielsen Holdings' 'Norfolk Southern Corp.' 'Northern Trust Corp.'
 'Nucor Corp.' 'Newell Brands' 'Realty Income Corporation' 'Omnicom Group'
 "O'Reilly Automotive" "People's United Financial" 'Pitney-Bowes'
 'PACCAR Inc.' 'PG&E Corp.' 'Public Serv. Enterprise Inc.' 'PepsiCo Inc.'
 'Principal Financial Group' 'Procter & Gamble' 'Progressive Corp.'
 'Pulte Homes Inc.' 'Philip Morris International' 'PNC Financial Services'
 'Pentair Ltd.' 'Pinnacle West Capital' 'PPG Industries' 'PPL Corp.'
 'Prudential Financial' 'Phillips 66' 'Quanta Services Inc.'
 'Praxair Inc.' 'PayPal' 'Ryder System' 'Royal Caribbean Cruises Ltd'
 'Robert Half International' 'Roper Industries' 'Republic Services Inc'
 'SCANA Corp' 'Charles Schwab Corporation' 'Sealed Air' 'Sherwin-Williams'
 'SL Green Realty' 'Scripps Networks Interactive Inc.' 'Southern Co.'
 'Simon Property Group Inc' 'Stericycle Inc' 'Sempra Energy'
 'SunTrust Banks' 'State Street Corp.' 'Skyworks Solutions'
 'Synchrony Financial' 'Stryker Corp.' 'Molson Coors Brewing Company'
 'Tegna, Inc.' 'Torchmark Corp.' 'Thermo Fisher Scientific' 'TripAdvisor'
 'The Travelers Companies Inc.' 'Tractor Supply Company' 'Tyson Foods'
 'Tesoro Petroleum Co.' 'Total System Services' 'Texas Instruments'
 'Under Armour' 'United Continental Holdings' 'UDR Inc'
 'Universal Health Services, Inc.' 'United Health Group Inc.' 'Unum Group'
 'Union Pacific' 'United Parcel Service' 'United Technologies'
 'Varian Medical Systems' 'Valero Energy' 'Vulcan Materials'
 'Vornado Realty Trust' 'Verisk Analytics' 'Verisign Inc.' 'Ventas Inc'
 'Wec Energy Group Inc' 'Whirlpool Corp.' 'Waste Management Inc.'
 'Western Union Co' 'Weyerhaeuser Corp.' 'Wyndham Worldwide'
 'Wynn Resorts Ltd' 'Xcel Energy Inc' 'XL Capital' 'Dentsply Sirona'
 'Xerox Corp.' 'Xylem Inc.' 'Yum! Brands Inc' 'Zimmer Biomet Holdings'
 'Zions Bancorp' 'Zoetis']

Cluster 0
['Analog Devices, Inc.' 'Alexion Pharmaceuticals' 'Amazon.com Inc'
 'Anadarko Petroleum Corp' 'Baker Hughes Inc' 'Cabot Oil & Gas'
 'Concho Resources' 'Devon Energy Corp.' 'EOG Resources' 'EQT Corporation'
 'Freeport-McMoran Cp & Gld' 'Halliburton Co.' 'Hess Corporation'
 'Hewlett Packard Enterprise' 'Kinder Morgan' 'Marathon Oil Corp.'
 'Murphy Oil' 'Noble Energy Inc' 'Netflix Inc.' 'Newfield Exploration Co'
 'National Oilwell Varco Inc.' 'ONEOK' 'Occidental Petroleum'
 'Range Resources Corp.' 'Spectra Energy Corp.' 'Southwestern Energy'
 'Teradata Corp.' 'Williams Cos.' 'Cimarex Energy']

Cluster 1
['Alliance Data Systems' 'Amgen Inc' 'Celgene Corp.'
 'Chipotle Mexican Grill' 'Equinix' 'Edwards Lifesciences' 'Facebook'
 'Frontier Communications' 'Intuitive Surgical Inc.' "McDonald's Corp."
 'Monster Beverage' 'Priceline.com Inc' 'Regeneron'
 'Vertex Pharmaceuticals Inc' 'Waters Corporation' 'Yahoo Inc.']

Cluster 4
['Allegion' 'Apache Corporation' 'Chesapeake Energy'
 'Charter Communications' 'Colgate-Palmolive' 'Kimberly-Clark'
 'S&P Global, Inc.']

Cluster 2
['Bank of America Corp' 'Citigroup Inc.' 'Ford Motor' 'Intel Corp.'
 'JPMorgan Chase & Co.' 'Coca Cola Company' 'Pfizer Inc.' 'AT&T Inc'
 'Verizon Communications' 'Wells Fargo' 'Exxon Mobil Corp.']

There is a moderate imbalance between cluster sizes and the cophentic coefficient for the Ward dendogram was only 0.69, but every other hierarchical clustering method had a single cluster containing more than 90% of the data points. This may be due to:

  • The curse of dimensionality
  • Outliers strongly affecting the clustering
  • Too few data points
  • The data may be varied enough that there are no good clusters
In [67]:
# let's display the hc_cluster_profile_1
hc_cluster_profile_1.style.highlight_max(color="lightgreen", axis=0)
Out[67]:
  Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio Count of Companies in this Cluster
Cluster                        
0 66.875863 -11.513243 2.684321 44.965517 67.517241 270355000.000000 -2279596724.137931 -5.930345 557924857.946207 119.832148 1.185239 29
1 272.166454 9.664338 1.587885 23.375000 337.187500 547402937.500000 1135918937.500000 7.585000 578873966.226250 40.700126 19.675682 16
2 42.848182 6.270446 1.123547 22.727273 71.454545 558636363.636364 14631272727.272728 3.410000 4242572567.290909 15.242169 -4.924615 11
3 72.697884 5.306429 1.409486 25.631769 55.718412 425447.653430 1551832010.830325 3.729477 437878516.249134 23.459500 -2.881193 277
4 84.355716 3.854981 1.827670 633.571429 33.571429 -568400000.000000 -4968157142.857142 -10.841429 398169036.442857 42.284541 -11.589502 7
In [68]:
# let's see the minimums as well
hc_cluster_profile_1.style.highlight_min(color="red", axis=0)
Out[68]:
  Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio Count of Companies in this Cluster
Cluster                        
0 66.875863 -11.513243 2.684321 44.965517 67.517241 270355000.000000 -2279596724.137931 -5.930345 557924857.946207 119.832148 1.185239 29
1 272.166454 9.664338 1.587885 23.375000 337.187500 547402937.500000 1135918937.500000 7.585000 578873966.226250 40.700126 19.675682 16
2 42.848182 6.270446 1.123547 22.727273 71.454545 558636363.636364 14631272727.272728 3.410000 4242572567.290909 15.242169 -4.924615 11
3 72.697884 5.306429 1.409486 25.631769 55.718412 425447.653430 1551832010.830325 3.729477 437878516.249134 23.459500 -2.881193 277
4 84.355716 3.854981 1.827670 633.571429 33.571429 -568400000.000000 -4968157142.857142 -10.841429 398169036.442857 42.284541 -11.589502 7
In [69]:
plt.figure(figsize=(20, 35))
plt.suptitle("Boxplot of scaled numerical variables for each cluster", fontsize=20)

# create a list of the numerical columns
num_cols = dfhc1.select_dtypes(include=np.number).columns.tolist()

for i, variable in enumerate(num_cols):
    plt.subplot(5, 3, i + 1)
    sns.boxplot(data=dfhc1, x="Cluster", y=variable)

plt.tight_layout(pad=2.0)

Insights for HC for data1¶

  • Cluster 0 had the highest average Volatility, the highest average P/E Ratio, and the most negative average Price Change
  • Cluster 1 was one of the best performing groups. Cluster one had the highest average Current Price, the most positive average Price Change, the highest Cash Ratio, the highest average P/B Ratio, and the highest average Earnings Per Share. They didn't have the lowest average stat in any category.
  • Cluster 2 had the highest average Net Cash Flow, Net Income, and Estimated Outstanding Shares. They also had the lowest average Current Price, the least average Volatility, the least average ROE, and the least average P/E Ratio.
  • Cluster 3 had the most data points. As such, they did not perform the best or worst an any category as a cluster. This is to be expected since the weight of extra data points prevented outliers from affecting the averages as much.
  • Cluster 4 was the smallest cluster and had the highest average ROE. The also had the least average in several categories: Cash Ratio, Net Cash Flow, Net Income, Earnings Per Share, Estimated Shares Outstanding, and P/B Ratio.

Hierarchical Clustering for data2¶

As a reminder: data2 is a copy of the data frame that drops both GICS Sector and GICS Sub Industry. This data frame may lose too much information to be useful but will not suffer from the curse of dimensionality at all

In [70]:
hc_df2 = data2.copy()

Cophentic Correlations for data2¶

In [71]:
# distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]

# linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]

high_cophenet_corr = 0
high_dm_lm = [0, 0]

for dm in distance_metrics:
    for lm in linkage_methods:
        Z = linkage(hc_df2, metric=dm, method=lm)
        c, coph_dists = cophenet(Z, pdist(hc_df2))
        print(
            "Cophenetic correlation for {} distance and {} linkage is {}.".format(
                dm.capitalize(), lm, c
            )
        )
        if high_cophenet_corr < c:
            high_cophenet_corr = c
            high_dm_lm[0] = dm
            high_dm_lm[1] = lm

print(
    "The highest cophenetic correlation is {} for {} distance and {} linkage.".format(
        high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
    )
)
Cophenetic correlation for Euclidean distance and single linkage is 0.9232271494002922.
Cophenetic correlation for Euclidean distance and complete linkage is 0.7873280186580672.
Cophenetic correlation for Euclidean distance and average linkage is 0.9422540609560814.
Cophenetic correlation for Euclidean distance and weighted linkage is 0.8693784298129404.
Cophenetic correlation for Chebyshev distance and single linkage is 0.9062538164750717.
Cophenetic correlation for Chebyshev distance and complete linkage is 0.598891419111242.
Cophenetic correlation for Chebyshev distance and average linkage is 0.9338265528030499.
Cophenetic correlation for Chebyshev distance and weighted linkage is 0.9127355892367.
Cophenetic correlation for Mahalanobis distance and single linkage is 0.925919553052459.
Cophenetic correlation for Mahalanobis distance and complete linkage is 0.7925307202850002.
Cophenetic correlation for Mahalanobis distance and average linkage is 0.9247324030159736.
Cophenetic correlation for Mahalanobis distance and weighted linkage is 0.8708317490180428.
Cophenetic correlation for Cityblock distance and single linkage is 0.9334186366528574.
Cophenetic correlation for Cityblock distance and complete linkage is 0.7375328863205818.
Cophenetic correlation for Cityblock distance and average linkage is 0.9302145048594667.
Cophenetic correlation for Cityblock distance and weighted linkage is 0.731045513520281.
The highest cophenetic correlation is 0.9422540609560814 for Euclidean distance and average linkage.

It appears the best distance type and linkage method is Euclidean distance with Average linkage

In [72]:
# let's check linkage types for Euclidean only
linkage_types = ["single", "complete", "average", "weighted", "centroid", "median", "ward"]

high_cophenet_corr = 0
high_dm_lm = [0, 0]

for lm in linkage_types:
    Z = linkage(hc_df2, metric="euclidean", method=lm)
    c, coph_dists = cophenet(Z, pdist(hc_df2))
    print(
        "Cophenetic correlation for Euclidean distance and {} linkage is {}.".format(
            lm, c
        )
    )
    if high_cophenet_corr < c:
        high_cophenet_corr = c
        high_dm_lm[0] = "Euclidean"
        high_dm_lm[1] = lm

# let's print the linkage method that works best for Euclidean
print(
    "The highest cophenetic correlation is {} for {} distance and {} linkage.".format(
        high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
    )
)
Cophenetic correlation for Euclidean distance and single linkage is 0.9232271494002922.
Cophenetic correlation for Euclidean distance and complete linkage is 0.7873280186580672.
Cophenetic correlation for Euclidean distance and average linkage is 0.9422540609560814.
Cophenetic correlation for Euclidean distance and weighted linkage is 0.8693784298129404.
Cophenetic correlation for Euclidean distance and centroid linkage is 0.9314012446828154.
Cophenetic correlation for Euclidean distance and median linkage is 0.9198690668829905.
Cophenetic correlation for Euclidean distance and ward linkage is 0.7101180299865353.
The highest cophenetic correlation is 0.9422540609560814 for Euclidean distance and average linkage.

We found the same results.

Checking Dendograms for data2¶

In [73]:
# linkage types
linkage_types = ["single", "complete", "average", "centroid", "ward", "weighted", "median"]

# lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
compare = []

# to create a subplot image
fig, axs = plt.subplots(len(linkage_types), 1, figsize=(15, 30))

# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_types):
    Z = linkage(hc_df2, metric="euclidean", method=method)

    dendrogram(Z, ax=axs[i])
    axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")

    coph_corr, coph_dist = cophenet(Z, pdist(hc_df2))
    axs[i].annotate(
        f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
        (0.80, 0.80),
        xycoords="axes fraction",
    )
    compare.append([method, coph_corr])
In [74]:
# create and print a dataframe to compare cophenetic correlations for different linkage methods
df_cc2 = pd.DataFrame(compare, columns=compare_cols)
df_cc2 = df_cc2.sort_values(by="Cophenetic Coefficient")
df_cc2
Out[74]:
Linkage Cophenetic Coefficient
4 ward 0.710118
1 complete 0.787328
5 weighted 0.869378
6 median 0.919869
0 single 0.923227
3 centroid 0.931401
2 average 0.942254

It appears again that despite dropping the extra dummy variables for GICS Sector and GICS Sub Industry, every clustering method except Ward puts more than 90% of the data into one cluster. However, compared to data1, we have a higher cophenetic coefficient so we are outperforming data1 hierarchical clustering. We will use Ward again because it gives the best variability of clusters despite only having a cophenetic coefficient of 0.71.

Creating a model using sklearn for data2¶

In [75]:
# create a model using 4 clusters, and average linkage
HCmodel2 = AgglomerativeClustering(n_clusters=5, linkage="ward")
HCmodel2.fit(hc_df2)

# adding the cluster labels to the a copy of the original data frame as well as the scaled data frame
dfhc2 = data.copy()
hc_df2["Cluster"] = HCmodel2.labels_ # the scaled data
dfhc2["Cluster"] = HCmodel2.labels_ # the non-scaled data

Cluster Profiling for data2¶

In [76]:
hc_cluster_profile_2 = dfhc2.groupby("Cluster").mean(numeric_only=True)
hc_cluster_profile_2["Count of Companies in this Cluster"] = (
    dfhc2.groupby("Cluster")["Security"].count().values)
In [77]:
# let's exampine which companies are in each cluster
for cluster in dfhc2['Cluster'].unique():
    print(f'Cluster {cluster}')
    print(dfhc2[dfhc2['Cluster'] == cluster]['Security'].unique())
    print('')
Cluster 3
['American Airlines Group' 'AbbVie' 'Abbott Laboratories'
 'Adobe Systems Inc' 'Analog Devices, Inc.' 'Archer-Daniels-Midland Co'
 'Ameren Corp' 'American Electric Power' 'AFLAC Inc'
 'American International Group, Inc.' 'Apartment Investment & Mgmt'
 'Assurant Inc' 'Arthur J. Gallagher & Co.' 'Akamai Technologies Inc'
 'Albemarle Corp' 'Alaska Air Group Inc' 'Allstate Corp'
 'Applied Materials Inc' 'AMETEK Inc' 'Affiliated Managers Group Inc'
 'Ameriprise Financial' 'American Tower Corp A' 'AutoNation Inc'
 'Anthem Inc.' 'Aon plc' 'Amphenol Corp' 'Arconic Inc'
 'Activision Blizzard' 'AvalonBay Communities, Inc.' 'Broadcom'
 'American Water Works Company Inc' 'American Express Co' 'Boeing Company'
 'Baxter International Inc.' 'BB&T Corporation' 'Bard (C.R.) Inc.'
 'BIOGEN IDEC Inc.' 'The Bank of New York Mellon Corp.' 'Ball Corp'
 'Bristol-Myers Squibb' 'Boston Scientific' 'BorgWarner'
 'Boston Properties' 'Caterpillar Inc.' 'Chubb Limited' 'CBRE Group'
 'Crown Castle International Corp.' 'Carnival Corp.' 'Celgene Corp.'
 'CF Industries Holdings Inc' 'Citizens Financial Group' 'Church & Dwight'
 'C. H. Robinson Worldwide' 'CIGNA Corp.' 'Cincinnati Financial'
 'Comerica Inc.' 'CME Group Inc.' 'Cummins Inc.' 'CMS Energy'
 'Centene Corporation' 'CenterPoint Energy' 'Capital One Financial'
 'The Cooper Companies' 'CSX Corp.' 'CenturyLink Inc'
 'Cognizant Technology Solutions' 'Citrix Systems' 'CVS Health'
 'Chevron Corp.' 'Dominion Resources' 'Delta Air Lines' 'Du Pont (E.I.)'
 'Deere & Co.' 'Discover Financial Services' 'Quest Diagnostics'
 'Danaher Corp.' 'The Walt Disney Company' 'Discovery Communications-A'
 'Discovery Communications-C' 'Delphi Automotive' 'Digital Realty Trust'
 'Dun & Bradstreet' 'Dover Corp.' 'Dr Pepper Snapple Group' 'Duke Energy'
 'DaVita Inc.' 'eBay Inc.' 'Ecolab Inc.' 'Consolidated Edison'
 'Equifax Inc.' "Edison Int'l" 'Eastman Chemical' 'Equity Residential'
 'EQT Corporation' 'Eversource Energy' 'Essex Property Trust, Inc.'
 'E*Trade' 'Eaton Corporation' 'Entergy Corp.' 'Edwards Lifesciences'
 'Exelon Corp.' "Expeditors Int'l" 'Expedia Inc.' 'Extra Space Storage'
 'Fastenal Co' 'Fortune Brands Home & Security' 'FirstEnergy Corp'
 'Fidelity National Information Services' 'Fiserv Inc' 'FLIR Systems'
 'Fluor Corp.' 'Flowserve Corporation' 'FMC Corporation'
 'Federal Realty Investment Trust' 'First Solar Inc' 'General Dynamics'
 'General Growth Properties Inc.' 'Gilead Sciences' 'Corning Inc.'
 'General Motors' 'Genuine Parts' 'Garmin Ltd.' 'Goodyear Tire & Rubber'
 'Grainger (W.W.) Inc.' 'Halliburton Co.' 'Hasbro Inc.'
 'Huntington Bancshares' 'HCA Holdings' 'Welltower Inc.' 'HCP Inc.'
 'Hartford Financial Svc.Gp.' 'Harley-Davidson' "Honeywell Int'l Inc."
 'Hewlett Packard Enterprise' 'HP Inc.' 'Hormel Foods Corp.'
 'Henry Schein' 'Host Hotels & Resorts' 'The Hershey Company'
 'Humana Inc.' 'International Business Machines' 'IDEXX Laboratories'
 'Intl Flavors & Fragrances' 'International Paper' 'Interpublic Group'
 'Iron Mountain Incorporated' 'Illinois Tool Works' 'Invesco Ltd.'
 'J. B. Hunt Transport Services' 'Jacobs Engineering Group'
 'Juniper Networks' 'Kimco Realty' 'Kansas City Southern'
 'Leggett & Platt' 'Lennar Corp.' 'Laboratory Corp. of America Holding'
 'LKQ Corporation' 'L-3 Communications Holdings' 'Lilly (Eli) & Co.'
 'Lockheed Martin Corp.' 'Alliant Energy Corp' 'Leucadia National Corp.'
 'Southwest Airlines' 'Level 3 Communications' 'LyondellBasell'
 'Mastercard Inc.' 'Mid-America Apartments' 'Macerich' "Marriott Int'l."
 'Masco Corp.' 'Mattel Inc.' "McDonald's Corp." "Moody's Corp"
 'Mondelez International' 'MetLife Inc.' 'Mohawk Industries'
 'Mead Johnson' 'McCormick & Co.' 'Martin Marietta Materials'
 'Marsh & McLennan' '3M Company' 'Altria Group Inc' 'The Mosaic Company'
 'Marathon Petroleum' 'Merck & Co.' 'M&T Bank Corp.' 'Mettler Toledo'
 'Mylan N.V.' 'Navient' 'NASDAQ OMX Group' 'NextEra Energy'
 'Newmont Mining Corp. (Hldg. Co.)' 'Nielsen Holdings'
 'Norfolk Southern Corp.' 'Northern Trust Corp.' 'Nucor Corp.'
 'Newell Brands' 'Realty Income Corporation' 'Omnicom Group'
 "O'Reilly Automotive" "People's United Financial" 'Pitney-Bowes'
 'PACCAR Inc.' 'PG&E Corp.' 'Public Serv. Enterprise Inc.' 'PepsiCo Inc.'
 'Principal Financial Group' 'Procter & Gamble' 'Progressive Corp.'
 'Pulte Homes Inc.' 'Philip Morris International' 'PNC Financial Services'
 'Pentair Ltd.' 'Pinnacle West Capital' 'PPG Industries' 'PPL Corp.'
 'Prudential Financial' 'Phillips 66' 'Quanta Services Inc.'
 'Praxair Inc.' 'PayPal' 'Ryder System' 'Royal Caribbean Cruises Ltd'
 'Robert Half International' 'Roper Industries' 'Republic Services Inc'
 'SCANA Corp' 'Charles Schwab Corporation' 'Sealed Air' 'Sherwin-Williams'
 'SL Green Realty' 'Scripps Networks Interactive Inc.' 'Southern Co.'
 'Simon Property Group Inc' 'Stericycle Inc' 'Sempra Energy'
 'SunTrust Banks' 'State Street Corp.' 'Skyworks Solutions'
 'Synchrony Financial' 'Stryker Corp.' 'Molson Coors Brewing Company'
 'Tegna, Inc.' 'Torchmark Corp.' 'Thermo Fisher Scientific' 'TripAdvisor'
 'The Travelers Companies Inc.' 'Tractor Supply Company' 'Tyson Foods'
 'Tesoro Petroleum Co.' 'Total System Services' 'Texas Instruments'
 'Under Armour' 'United Continental Holdings' 'UDR Inc'
 'Universal Health Services, Inc.' 'United Health Group Inc.' 'Unum Group'
 'Union Pacific' 'United Parcel Service' 'United Technologies'
 'Varian Medical Systems' 'Valero Energy' 'Vulcan Materials'
 'Vornado Realty Trust' 'Verisk Analytics' 'Verisign Inc.'
 'Vertex Pharmaceuticals Inc' 'Ventas Inc' 'Wec Energy Group Inc'
 'Whirlpool Corp.' 'Waste Management Inc.' 'Western Union Co'
 'Weyerhaeuser Corp.' 'Wyndham Worldwide' 'Wynn Resorts Ltd'
 'Xcel Energy Inc' 'XL Capital' 'Dentsply Sirona' 'Xerox Corp.'
 'Xylem Inc.' 'Yum! Brands Inc' 'Zimmer Biomet Holdings' 'Zions Bancorp'
 'Zoetis']

Cluster 0
['Alliance Data Systems' 'Alexion Pharmaceuticals' 'Amgen Inc'
 'Amazon.com Inc' 'Chipotle Mexican Grill' 'Equinix' 'Facebook'
 'Frontier Communications' 'Intuitive Surgical Inc.' 'Monster Beverage'
 'Netflix Inc.' 'Priceline.com Inc' 'Regeneron' 'Waters Corporation'
 'Yahoo Inc.']

Cluster 1
['Allegion' 'Apache Corporation' 'Chesapeake Energy'
 'Charter Communications' 'Colgate-Palmolive' 'Kimberly-Clark'
 'S&P Global, Inc.']

Cluster 4
['Anadarko Petroleum Corp' 'Baker Hughes Inc' 'Cabot Oil & Gas'
 'Concho Resources' 'Devon Energy Corp.' 'EOG Resources'
 'Freeport-McMoran Cp & Gld' 'Hess Corporation' 'Kinder Morgan'
 'Marathon Oil Corp.' 'Murphy Oil' 'Noble Energy Inc'
 'Newfield Exploration Co' 'National Oilwell Varco Inc.' 'ONEOK'
 'Occidental Petroleum' 'Range Resources Corp.' 'Spectra Energy Corp.'
 'Southwestern Energy' 'Teradata Corp.' 'Williams Cos.' 'Cimarex Energy']

Cluster 2
['Bank of America Corp' 'Citigroup Inc.' 'Ford Motor' 'Intel Corp.'
 'JPMorgan Chase & Co.' 'Coca Cola Company' 'Pfizer Inc.' 'AT&T Inc'
 'Verizon Communications' 'Wells Fargo' 'Exxon Mobil Corp.']

There is a moderate imbalance between cluster sizes and the cophentic coefficient for the Ward dendogram was only 0.71, but every other hierarchical clustering method had a single cluster containing more than 90% of the data points. This may be due to:

  • Outliers strongly affecting the clustering
  • Too few data points
  • The data may be varied enough that there are no good clusters

Unlike data1, we know this is not due to curse of dimensionality because we did not make dummy variables for GICS Sector or GICS Sub Industry for data2.

In [78]:
# let's display the hc_cluster_profile_2
hc_cluster_profile_2.style.highlight_max(color="lightgreen", axis=0)
Out[78]:
  Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio Count of Companies in this Cluster
Cluster                        
0 326.198218 10.563242 1.642560 14.400000 309.466667 288850666.666667 864498533.333333 7.785333 544900261.301333 113.095334 19.142151 15
1 84.355716 3.854981 1.827670 633.571429 33.571429 -568400000.000000 -4968157142.857142 -10.841429 398169036.442857 42.284541 -11.589502 7
2 42.848182 6.270446 1.123547 22.727273 71.454545 558636363.636364 14631272727.272728 3.410000 4242572567.290909 15.242169 -4.924615 11
3 72.760400 5.213307 1.427078 25.603509 60.392982 79951512.280702 1538594322.807018 3.655351 446472132.228456 24.722670 -2.647194 285
4 36.440455 -16.073408 2.832884 57.500000 42.409091 -472834090.909091 -3161045227.272727 -8.005000 514367806.201818 85.555682 0.836839 22
In [79]:
# let's see the minimums as well
hc_cluster_profile_2.style.highlight_min(color="red", axis=0)
Out[79]:
  Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio Count of Companies in this Cluster
Cluster                        
0 326.198218 10.563242 1.642560 14.400000 309.466667 288850666.666667 864498533.333333 7.785333 544900261.301333 113.095334 19.142151 15
1 84.355716 3.854981 1.827670 633.571429 33.571429 -568400000.000000 -4968157142.857142 -10.841429 398169036.442857 42.284541 -11.589502 7
2 42.848182 6.270446 1.123547 22.727273 71.454545 558636363.636364 14631272727.272728 3.410000 4242572567.290909 15.242169 -4.924615 11
3 72.760400 5.213307 1.427078 25.603509 60.392982 79951512.280702 1538594322.807018 3.655351 446472132.228456 24.722670 -2.647194 285
4 36.440455 -16.073408 2.832884 57.500000 42.409091 -472834090.909091 -3161045227.272727 -8.005000 514367806.201818 85.555682 0.836839 22
In [80]:
plt.figure(figsize=(20, 35))
plt.suptitle("Boxplot of scaled numerical variables for each cluster", fontsize=20)

# create a list of the numerical columns
num_cols = dfhc2.select_dtypes(include=np.number).columns.tolist()

for i, variable in enumerate(num_cols):
    plt.subplot(5, 3, i + 1)
    sns.boxplot(data=dfhc2, x="Cluster", y=variable)

plt.tight_layout(pad=2.0)

Insights for HC for data2¶

  • Cluster 0 had the highest average: Current Price, positive Price Change, Cash Ratio, Earnings Per Share, P/E Ratio, and P/B Ratio. It also had the least average ROE.
  • Cluster 1, the smallest cluster, had the highest average ROE. It also had the least average in many areas: Cash Ratio, Net Cash Flow, Net Income, Earnings Per Share, Estimated Shares Outstanding, and P/B Ratio
  • Cluster 2 had the highest average: Net Cash Flow, Net Income, and Estimated Shares Outstanding. It also had the least volatility and the least P/E Ratio.
  • Cluster 3 had the most data points, close to 83%. Because of the number of data points, it was less affected by outliers and did not have the highest or lowest average in any catagory.
  • Cluster 4 had the highest average Volatility and the lowest average Current Price as well as the most negative average Price Change.

K-means vs Hierarchical Clustering¶

  • Both forms of clustering had low computation time. I think this may be due to the low number of data points. Kmeans was slightly faster though.
  • Hierarchical Clustering using Ward linkage gave the most distinct clusters. This was at the cost of selecting the dendogram with the lowest cophentic coefficient (approx 0.7). That being said, the KMeans Clustering had extremely poor silhouette scores (approx 0.35) and still put more than 90% of the data points into a single cluster.
  • Both clustering method created similar clusters. For each of the four models a pattern occured: One cluster had most of the data points and didn't overperform or under perform in any single category. Additionally, one small cluster contained a small number of energy companies with extremely high ROE.
  • Across four models, I obtained 4 clusters once and 5 clusters 3 times. This makes me think that 5 is the best number of clusters.

Actionable Insights and Recommendations¶

If I was the one investing, I would want to avoid the clusters with high variability for safer investments. Because I've build four models and for each model the largest cluster was least weighted by outliers, I would be inclined to pick the companies that appeared in the largest cluster for all four models. This would protect my investments from being invested in companies that were clusters according to poor performance. While some of the small cluster performed well in one area or another, every small cluster also underperformed in some area. Below I will do some code to figure out which companies repeatedly appeared in the largest, safest cluster.

In [82]:
# Get the companies that meet the criteria from each data frame
securities_km1 = set(dfkm1[dfkm1['Cluster'] == 1]['Security']) # for each model I went back and checked which cluster was largest
securities_km2 = set(dfkm2[dfkm2['Cluster'] == 0]['Security'])
securities_hc1 = set(dfhc1[dfhc1['Cluster'] == 3]['Security'])
securities_hc2 = set(dfhc2[dfhc2['Cluster'] == 3]['Security'])

# Find the intersection of the sets
result_securities = securities_km1.intersection(securities_km2, securities_hc1, securities_hc2)

# Convert the result to a list (if needed)
result_securities_list = list(result_securities)

# Display the result
print(result_securities_list)

# Check how many securities are in result_securities_list
print(len(result_securities_list))
['Stryker Corp.', 'Vornado Realty Trust', 'Capital One Financial', 'Mylan N.V.', 'Fiserv Inc', "Honeywell Int'l Inc.", 'Western Union Co', 'Zoetis', '3M Company', 'Mattel Inc.', 'Aon plc', 'Lockheed Martin Corp.', 'FirstEnergy Corp', 'CME Group Inc.', 'Procter & Gamble', 'Illinois Tool Works', 'Ameriprise Financial', 'HCA Holdings', 'Lennar Corp.', 'Boston Scientific', 'Juniper Networks', 'American Water Works Company Inc', "Marriott Int'l.", 'Xcel Energy Inc', 'Zions Bancorp', 'L-3 Communications Holdings', 'Adobe Systems Inc', 'Anthem Inc.', 'Fastenal Co', 'Valero Energy', 'Arthur J. Gallagher & Co.', 'Unum Group', 'Amphenol Corp', 'Consolidated Edison', 'American Express Co', 'NASDAQ OMX Group', 'Dr Pepper Snapple Group', 'Bristol-Myers Squibb', 'Iron Mountain Incorporated', 'AbbVie', 'Realty Income Corporation', 'Southwest Airlines', 'Ball Corp', 'Molson Coors Brewing Company', 'Citrix Systems', 'SCANA Corp', 'Humana Inc.', "People's United Financial", 'Torchmark Corp.', 'Carnival Corp.', 'Allstate Corp', "Expeditors Int'l", 'Alaska Air Group Inc', 'American Electric Power', 'Marsh & McLennan', 'Leucadia National Corp.', 'Eversource Energy', 'Eaton Corporation', 'Pulte Homes Inc.', 'Garmin Ltd.', 'Entergy Corp.', 'Goodyear Tire & Rubber', 'PPG Industries', 'Varian Medical Systems', 'Baxter International Inc.', 'FLIR Systems', 'United Health Group Inc.', 'Southern Co.', "Edison Int'l", 'AFLAC Inc', 'Omnicom Group', 'Tractor Supply Company', 'Principal Financial Group', 'Martin Marietta Materials', 'UDR Inc', 'Du Pont (E.I.)', 'Mettler Toledo', 'Lilly (Eli) & Co.', 'United Continental Holdings', 'Synchrony Financial', 'Vulcan Materials', 'Expedia Inc.', 'Ameren Corp', 'Universal Health Services, Inc.', 'Exelon Corp.', 'Intl Flavors & Fragrances', 'Roper Industries', 'DaVita Inc.', 'Mondelez International', 'Republic Services Inc', 'J. B. Hunt Transport Services', 'CSX Corp.', 'CenterPoint Energy', 'Xylem Inc.', 'Delphi Automotive', 'PACCAR Inc.', 'Progressive Corp.', 'Masco Corp.', 'Mohawk Industries', 'IDEXX Laboratories', 'eBay Inc.', 'Mastercard Inc.', 'CMS Energy', 'Mead Johnson', 'Wyndham Worldwide', 'Applied Materials Inc', 'CF Industries Holdings Inc', 'SunTrust Banks', 'Alliant Energy Corp', 'Duke Energy', 'Welltower Inc.', 'Harley-Davidson', 'Waste Management Inc.', 'Cognizant Technology Solutions', 'State Street Corp.', 'Centene Corporation', 'Host Hotels & Resorts', 'Dominion Resources', 'Charles Schwab Corporation', 'International Paper', 'United Technologies', 'Simon Property Group Inc', 'Huntington Bancshares', 'Deere & Co.', 'Verisign Inc.', 'Altria Group Inc', 'NextEra Energy', 'Equity Residential', 'American Tower Corp A', 'Archer-Daniels-Midland Co', 'AMETEK Inc', 'Albemarle Corp', 'General Dynamics', 'Hasbro Inc.', 'McCormick & Co.', 'PNC Financial Services', 'The Hershey Company', 'Fortune Brands Home & Security', 'United Parcel Service', 'Praxair Inc.', 'E*Trade', 'HP Inc.', 'Flowserve Corporation', 'Kimco Realty', 'Apartment Investment & Mgmt', 'Church & Dwight', 'Ventas Inc', "Moody's Corp", 'Hartford Financial Svc.Gp.', 'American Airlines Group', 'The Walt Disney Company', 'The Bank of New York Mellon Corp.', 'Navient', 'Affiliated Managers Group Inc', 'Dentsply Sirona', 'Discover Financial Services', 'Genuine Parts', 'Macerich', 'Pitney-Bowes', 'BorgWarner', 'Philip Morris International', 'Scripps Networks Interactive Inc.', 'SL Green Realty', 'Nielsen Holdings', 'Public Serv. Enterprise Inc.', 'Sempra Energy', 'Quest Diagnostics', 'General Growth Properties Inc.', 'Stericycle Inc', 'Cincinnati Financial', 'Federal Realty Investment Trust', 'Interpublic Group', 'Tegna, Inc.', 'Northern Trust Corp.', 'MetLife Inc.', 'Total System Services', 'Newell Brands', 'Phillips 66', 'XL Capital', 'Leggett & Platt', 'Boeing Company', 'Comerica Inc.', 'Eastman Chemical', 'AvalonBay Communities, Inc.', 'Grainger (W.W.) Inc.', 'Verisk Analytics', 'Boston Properties', 'Assurant Inc', 'Under Armour', 'Whirlpool Corp.', 'Activision Blizzard', 'Norfolk Southern Corp.', 'Cummins Inc.', 'Level 3 Communications', 'Crown Castle International Corp.', 'LyondellBasell', 'Marathon Petroleum', 'CBRE Group', 'PayPal', 'General Motors', 'Essex Property Trust, Inc.', 'PG&E Corp.', 'Pentair Ltd.', 'PepsiCo Inc.', 'Delta Air Lines', 'Pinnacle West Capital', 'Robert Half International', 'HCP Inc.', 'Nucor Corp.', 'Fidelity National Information Services', 'Ryder System', 'Zimmer Biomet Holdings', 'C. H. Robinson Worldwide', 'American International Group, Inc.', 'Laboratory Corp. of America Holding', 'Corning Inc.', 'Caterpillar Inc.', 'Discovery Communications-C', 'Dun & Bradstreet', 'Union Pacific', 'Thermo Fisher Scientific', 'CenturyLink Inc', 'Royal Caribbean Cruises Ltd', 'Digital Realty Trust', 'Merck & Co.', 'Invesco Ltd.', 'Texas Instruments', 'FMC Corporation', "O'Reilly Automotive", 'Dover Corp.', 'Weyerhaeuser Corp.', 'Equifax Inc.', 'Henry Schein', 'Chevron Corp.', 'Xerox Corp.', 'Bard (C.R.) Inc.', 'Sealed Air', 'Kansas City Southern', 'Prudential Financial', 'Jacobs Engineering Group', 'M&T Bank Corp.', 'BB&T Corporation', 'Citizens Financial Group', 'CIGNA Corp.', 'Hormel Foods Corp.', 'Tesoro Petroleum Co.', 'Yum! Brands Inc', 'AutoNation Inc', 'Danaher Corp.', 'Ecolab Inc.', 'Extra Space Storage', 'Arconic Inc', 'The Travelers Companies Inc.', 'Chubb Limited', 'Abbott Laboratories', 'Sherwin-Williams', 'Wec Energy Group Inc', 'LKQ Corporation', 'Mid-America Apartments', 'The Cooper Companies', 'CVS Health', 'Fluor Corp.', 'PPL Corp.', 'International Business Machines', 'Discovery Communications-A', 'Broadcom', 'Tyson Foods']
267
In [86]:
# create a data frame with only companies that appear on the result_securities_list
df_result = data[data['Security'].isin(result_securities_list)]

# check the list for outliers below the lower quartile for any numeric column in the data frame
# create list of categories that you wouldn't want to invest if the company was a low outlier
categories = ['Price Change', 'ROE', 'Cash Ratio', 'Net Cash Flow', 'Net Income', 'Earnings Per Share', "P/E Ratio", "P/B Ratio"]

# Calculate the 25th percentile for each category
percentiles = {cat: df_result[cat].quantile(0.25) for cat in categories}

# Filter the DataFrame to exclude companies below the 25th percentile in any category
for cat, threshold in percentiles.items():
    df_result = df_result[df_result[cat] >= threshold]

# reindex df_result
df_result = df_result.reset_index(drop=True)

# Display the resulting DataFrame
print(df_result['Security'])
0                Abbott Laboratories
1                         AMETEK Inc
2                      Amphenol Corp
3                           Broadcom
4               Cincinnati Financial
5                        Deere & Co.
6            The Walt Disney Company
7     General Growth Properties Inc.
8                 Hormel Foods Corp.
9                       Invesco Ltd.
10                   Mastercard Inc.
11                          Macerich
12                      Moody's Corp
13                      Mead Johnson
14                        3M Company
15            Norfolk Southern Corp.
16                      PepsiCo Inc.
17                  Procter & Gamble
18                 Progressive Corp.
19       Philip Morris International
20                    PPG Industries
21          Simon Property Group Inc
22                       Tyson Foods
23              Vornado Realty Trust
Name: Security, dtype: object

This is a list of my final actionable recommendations. I would invest in the above companies. These companies were in the safest cluster and have been filtered for companies that performed poorly in any category you would not want a company to perform poorly in. There may be some companies that perform better in some areas, but they companies all perform relatively well in all areas. Closer scrutiny is required as I am not a financial advisor.

In [ ]: